TIKA: extending ContentHandlerDecorator - how to write string in UTF-8 Format?

ruby Wed, 10 Sep 2014 11:17:02 -0700

Hello,

I've extended the ContentHandlerDecorator following the example given here:
https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java
 
(please look at the example at the end of this page).


in my Characters function, I'm reading the characters and basically writing
them to a an Xml file post to solr. 
 
public void characters(char[] ch, int start, int length)
{
}

The reason I extended the ContentHandlerDecorator is because the file I'm
reading from is too large, so I want to break it up  into chunks and put
them in different solr document field. 


Some of the the files I'm reading from has special characters.  Before I
write  to the file, I create a string with the char[] array. I want to make
sure the newly created strings are all UTF-8 format (since solr only accepts
xmls in utf88). I tried converting the char[] to byte[] and finally to a
string with utf-8 format but that did not work. 
Charset.forName("UTF-8").encode(myString)

Also, tried writing the file using FileOutputSteam in utf-8:
new BufferedWriter(new OutputStreamWriter(
    new FileOutputStream("outfilename"), "UTF-8"));


When the xml gets posted to solr, I'm seeing error: 
[com.ctc.wstx.exc.WstxLazyException] Unexpected character ' ' (code 32);
expected a semi-colon after the reference for entity 'O' at [row,col
{unknown-source}]: [1,5332]

It's complaining about an ampersand in a pdf file. How do I get pass this
issue?  

Any hints/pointers are appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TIKA-extending-ContentHandlerDecorator-how-to-write-string-in-UTF-8-Format-tp4157974.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

TIKA: extending ContentHandlerDecorator - how to write string in UTF-8 Format?

Reply via email to