Hi, Please avoid cross-posting between dev@ and u...@. Responding only on dev@, as this is mostly related to Tika internals.
From: qubit [mailto:[email protected]] > First, it appears that the code in TextParser.java thinks it is > dealing with a file in plain text (isn't that the same as text/plain?) Correct. (Also, text/plain = plain text). > However it is output as xhtml with very little processing. I think I > mentioned before that things like '<' should be translated to '<' > and '&' should become '&'. Escaping happens only when a SAX event stream is serialized to a character or a byte stream. The character SAX events produced by a parser aren't supposed to be escaped. > I noticed the header and footer elements you output for the file. > But this translation, and probably other insertions, need to be > made to the text within the file itself, not the header/footer. > Otherwise the rendered xhtml will be wrong. I'm not sure what you're referring to here. Can you elaborate? > I have been trying to make this patch myself by looking at your > code, which has taken me into the SAX content handlers, and I > have one question: Is this code considered complete? Pretty much so. I think the basic SAX event handling machinery in Tika is already quite stable and there aren't any major open design issues in that part of our codebase. > I find XHTMLContentHandler code, which calls SAFEContentHandler code. > But I gather these methods have a different purpose than what I am > looking at. I thought to create a subclass SafeTextContentHandler of > SafeContentHandler to override the write function and provide the > necessary replacement strings. You're talking about entity escaping? There's no need to do this, as the functionality is already there in the Transformer part of JAXP. More generally it's usually better to use a decorator than a subclass when you want to customize the behavior of a SAX ContentHandler. BR, Jukka Zitting
