Greetings and thanks for your reply.
I'll reply to excerpted fragments.

<<- Please avoid cross-posting between dev@ and u...@. Responding only on 
dev@, as this is mostly related to Tika internals. ->>

Sorry about that.  I will send the rest of the mail on this thread only to 
dev.

<<-
> However it is output as xhtml with very little processing. I think I
> mentioned before that things like '<' should be translated to '&lt;'
> and '&' should become '&amp;'.

Escaping happens only when a SAX event stream is serialized to a character 
or a byte stream. The character SAX events produced by a parser aren't 
supposed to be escaped.
->>

Then perhaps I am in the wrong place in the code... or I am still not 
understanding all that sax is doing.  The translation needs to be done 
however because you are essentially outputting a plain text file as if it 
were xhtml with only header and footer elements slapped on the ends.  This 
doesn't work because suppose your plaintext file is a tutorial on html and 
contains sample fragments of html code.  If you output the text without 
translation of certain characters, the fragments will render as text and 
will not appear in the document that the end user sees.  A short example:
---- example ----
This is how you write a link in html: <a href="#here">hi there</a>
---- end example ----
If you slap xhtml header and footer onto this plain text and output it as 
xhtml, then the end user will see only the link "hi there" and not the 
expansion of the link source.
To prevent this, all < symbols should be translated to &lt; Furthermore, 
since the ampersand & also prefixes special character codes, tika should 
also translate & to &amp;
I do not know if it is necessary to convert the > symbols or the #, " or =. 
I believe only the less than and ampersand are essential to translate.

Does this answer your question?

<<- > I noticed the header and footer elements you output for the file.
> But this translation, and probably other insertions, need to be
> made to the text within the file itself, not the header/footer.
> Otherwise the rendered xhtml will be wrong.

I'm not sure what you're referring to here. Can you elaborate?
->>

See above.

<<- > I find XHTMLContentHandler code, which calls SAFEContentHandler code.
> But I gather these methods have a different purpose than what I am
> looking at. I thought to create a subclass SafeTextContentHandler of
> SafeContentHandler to override the write function and provide the
> necessary replacement strings.

You're talking about entity escaping? There's no need to do this, as the 
functionality is already there in the Transformer part of JAXP.

More generally it's usually better to use a decorator than a subclass when 
you want to customize the behavior of a SAX ContentHandler.
->>
->>

Ok, this bespeaks my newness looking at this code.  In your view, is this 
the right place to make a change? or am I misunderstanding the purpose of 
the content handler code?

I apreciate any comments.
--le

Reply via email to