Greetings and thanks for your reply. I'll reply to excerpted fragments. <<- Please avoid cross-posting between dev@ and u...@. Responding only on dev@, as this is mostly related to Tika internals. ->>
Sorry about that. I will send the rest of the mail on this thread only to dev. <<- > However it is output as xhtml with very little processing. I think I > mentioned before that things like '<' should be translated to '<' > and '&' should become '&'. Escaping happens only when a SAX event stream is serialized to a character or a byte stream. The character SAX events produced by a parser aren't supposed to be escaped. ->> Then perhaps I am in the wrong place in the code... or I am still not understanding all that sax is doing. The translation needs to be done however because you are essentially outputting a plain text file as if it were xhtml with only header and footer elements slapped on the ends. This doesn't work because suppose your plaintext file is a tutorial on html and contains sample fragments of html code. If you output the text without translation of certain characters, the fragments will render as text and will not appear in the document that the end user sees. A short example: ---- example ---- This is how you write a link in html: <a href="#here">hi there</a> ---- end example ---- If you slap xhtml header and footer onto this plain text and output it as xhtml, then the end user will see only the link "hi there" and not the expansion of the link source. To prevent this, all < symbols should be translated to < Furthermore, since the ampersand & also prefixes special character codes, tika should also translate & to & I do not know if it is necessary to convert the > symbols or the #, " or =. I believe only the less than and ampersand are essential to translate. Does this answer your question? <<- > I noticed the header and footer elements you output for the file. > But this translation, and probably other insertions, need to be > made to the text within the file itself, not the header/footer. > Otherwise the rendered xhtml will be wrong. I'm not sure what you're referring to here. Can you elaborate? ->> See above. <<- > I find XHTMLContentHandler code, which calls SAFEContentHandler code. > But I gather these methods have a different purpose than what I am > looking at. I thought to create a subclass SafeTextContentHandler of > SafeContentHandler to override the write function and provide the > necessary replacement strings. You're talking about entity escaping? There's no need to do this, as the functionality is already there in the Transformer part of JAXP. More generally it's usually better to use a decorator than a subclass when you want to customize the behavior of a SAX ContentHandler. ->> ->> Ok, this bespeaks my newness looking at this code. In your view, is this the right place to make a change? or am I misunderstanding the purpose of the content handler code? I apreciate any comments. --le
