tika and plain text -- bug or feature?

qubit Wed, 10 Nov 2010 12:35:30 -0800

Greetings.
I have been sifting through the code in TextParser.java and the various 
content handlers it invokes, and I have some questions.


First, it appears that the code in TextParser.java thinks it is dealing with 
a file in plain text (isn't that the same as text/plain ?)
However it is output as xhtml with very little processing.  I think I 
mentioned before that things like '<' should be translated to '&lt;' and '&' 
should become '&amp;'.
I noticed the header and footer elements you output for the file.  But this 
translation, and probably other insertions, need to be made to the text 
within the file itself, not the header/footer.  Otherwise the rendered xhtml 
will be wrong.

I have been trying to make this patch myself by looking at your code, which 
has taken me into the SAX content handlers, and I have one question:
Is this code considered complete? I find XHTMLContentHandler code, which 
calls SAFEContentHandler code.
But I gather these methods have a different purpose than what I am looking 
at.
I thought to create a subclass SafeTextContentHandler of SafeContentHandler 
to override the write function and provide the necessary replacement 
strings.
Or I could just handcode an extra check inside existing methods, but I think 
this would endanger other code that depends on these classes.

Anyway, please comment if you think I'm not approaching it the right way.
I wanted to do this myself rather than just report a bug as I want some 
experience with this source code, partly because I am new to java and partly 
because I may be using tika in a project I'm working on.

So any comments are welcome.
Thank you.
--le

tika and plain text -- bug or feature?

Reply via email to