tika-user  

Remove headers from the parser

Florent André
Mon, 25 Jan 2010 06:50:36 -0800

Hello, 

I use the AutoDetectParser.parse(java.io.InputStream stream,
org.xml.sax.ContentHandler handler, Metadata metadata). 

I use the parse function many times with the same ContentHandler. 

My problem is : 
- on each parse, tika send to the contentHandler the "xml header
definition" (<?xml version="1.0" encoding="UTF-8"?>)

This is a problem for me, because this sending don't allow me to parse the
contentHandler with a SAX element (cocoon transformer).

For example, after using of tika, my output is : 
<root>
<documentparse id="1" <?xml version="1.0" encoding="UTF-8"?>>
<html>
... content from tika
</html>
<documentparse id="2" <?xml version="1.0" encoding="UTF-8"?>>
<html>
... content from tika
</html>
</documentparse>

There is a way to deactivate the xml header sending ? 

Thanks in advance,
++