Hello,
I've extended the ContentHandlerDecorator following the example given here:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java
(please look at the example at the end of this page).
in my Characters function, I'm reading the characters and basically writing
them to a an Xml file post to solr.
public void characters(char[] ch, int start, int length)
{
}
The reason I extended the ContentHandlerDecorator is because the file I'm
reading from is too large, so I want to break it up into chunks and put
them in different solr document field.
Some of the the files I'm reading from has special characters. Before I
write to the file, I create a string with the char[] array. I want to make
sure the newly created strings are all UTF-8 format (since solr only accepts
xmls in utf88). I tried converting the char[] to byte[] and finally to a
string with utf-8 format but that did not work.
Charset.forName("UTF-8").encode(myString)
Also, tried writing the file using FileOutputSteam in utf-8:
new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("outfilename"), "UTF-8"));
When the xml gets posted to solr, I'm seeing error:
[com.ctc.wstx.exc.WstxLazyException] Unexpected character ' ' (code 32);
expected a semi-colon after the reference for entity 'O' at [row,col
{unknown-source}]: [1,5332]
It's complaining about an ampersand in a pdf file. How do I get pass this
issue?
Any hints/pointers are appreciated.
--
View this message in context:
http://lucene.472066.n3.nabble.com/TIKA-extending-ContentHandlerDecorator-how-to-write-string-in-UTF-8-Format-tp4157974.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]