Hi,
From: qubit [mailto:[email protected]]
> Then perhaps I am in the wrong place in the code... or I am still not
> understanding all that sax is doing. The translation needs to be done
> however because you are essentially outputting a plain text file as if
> it were xhtml with only header and footer elements slapped on the ends.
Actually we aren't. The relevant part of the code in TXTParser is:
XHTMLContentHandler xhtml = ...;
xhtml.startDocument();
xhtml.startElement("p");
xhtml.characters(...);
xhtml.endElement("p");
xhtml.endDocument();
(The XHTMLContentHandler class hides some of the complexities, but the basic
idea is the same as with a plain ContentHandler instance.)
The startElement() and endElement() calls above are handled differently from
the characters() call. If we were simply outputting text like you assume, we
could rewrite part of the above to:
xhtml.characters("<p>");
xhtml.characters(...);
xhtml.characters("</p>");
That wouldn't work, as it's the task of the ContentHandler instance that
serializes these SAX events to properly output any start and end elements
triggered by start/endElement() calls, and to correctly escape character data
given in characters() events.
See the JAXP documentation for more background on how SAX parsing and
serialization is designed to work.
> Ok, this bespeaks my newness looking at this code. In your view, is
> this the right place to make a change? or am I misunderstanding the
> purpose of the content handler code?
I believe you've slightly misunderstood the code. I'm sorry about not making
the intended design more apparent; we probably should document that part a bit
better.
To better understand how and where escaping actually happens, take a look at
the getTransformerHandler() method in the TikaCLI class (part of the tika-app
component). There we use the Transformer functionality in JAXP to automatically
handle the conversion from abstract SAX events (start/end elements, character
data, etc.) to the corresponding character and byte sequences.
BR,
Jukka Zitting