Greetings. I have been sifting through the code in TextParser.java and the various content handlers it invokes, and I have some questions.
First, it appears that the code in TextParser.java thinks it is dealing with a file in plain text (isn't that the same as text/plain ?) However it is output as xhtml with very little processing. I think I mentioned before that things like '<' should be translated to '<' and '&' should become '&'. I noticed the header and footer elements you output for the file. But this translation, and probably other insertions, need to be made to the text within the file itself, not the header/footer. Otherwise the rendered xhtml will be wrong. I have been trying to make this patch myself by looking at your code, which has taken me into the SAX content handlers, and I have one question: Is this code considered complete? I find XHTMLContentHandler code, which calls SAFEContentHandler code. But I gather these methods have a different purpose than what I am looking at. I thought to create a subclass SafeTextContentHandler of SafeContentHandler to override the write function and provide the necessary replacement strings. Or I could just handcode an extra check inside existing methods, but I think this would endanger other code that depends on these classes. Anyway, please comment if you think I'm not approaching it the right way. I wanted to do this myself rather than just report a bug as I want some experience with this source code, partly because I am new to java and partly because I may be using tika in a project I'm working on. So any comments are welcome. Thank you. --le
