RE: tika and plain text -- bug or feature?

Jukka Zitting Wed, 10 Nov 2010 16:22:03 -0800

Hi,

From: qubit [mailto:[email protected]]
> Then perhaps I am in the wrong place in the code... or I am still not
> understanding all that sax is doing.  The translation needs to be done
> however because you are essentially outputting a plain text file as if
> it were xhtml with only header and footer elements slapped on the ends.


Actually we aren't. The relevant part of the code in TXTParser is:

    XHTMLContentHandler xhtml = ...;
    xhtml.startDocument();

    xhtml.startElement("p");
    xhtml.characters(...);
    xhtml.endElement("p");

    xhtml.endDocument();

(The XHTMLContentHandler class hides some of the complexities, but the basic 
idea is the same as with a plain ContentHandler instance.)

The startElement() and endElement() calls above are handled differently from 
the characters() call. If we were simply outputting text like you assume, we 
could rewrite part of the above to:

    xhtml.characters("<p>");
    xhtml.characters(...);
    xhtml.characters("</p>");

That wouldn't work, as it's the task of the ContentHandler instance that 
serializes these SAX events to properly output any start and end elements 
triggered by start/endElement() calls, and to correctly escape character data 
given in characters() events.

See the JAXP documentation for more background on how SAX parsing and 
serialization is designed to work.

> Ok, this bespeaks my newness looking at this code. In your view, is
> this the right place to make a change? or am I misunderstanding the
> purpose of the content handler code?

I believe you've slightly misunderstood the code. I'm sorry about not making 
the intended design more apparent; we probably should document that part a bit 
better.

To better understand how and where escaping actually happens, take a look at 
the getTransformerHandler() method in the TikaCLI class (part of the tika-app 
component). There we use the Transformer functionality in JAXP to automatically 
handle the conversion from abstract SAX events (start/end elements, character 
data, etc.) to the corresponding character and byte sequences.

BR,

Jukka Zitting

RE: tika and plain text -- bug or feature?

Reply via email to