Hi Jukka, I am splitting the thread. Thanks to your explanation and playing with the code I understood better how it works: basically it uses a SAX builder, than it depends by the builder to add or not the XHTML markup. BodyContentHandler does not add the markup -> plain text; ToXMLContentHandler adds the markcup -> XHTML.
Being that the case, the name PDF2XHTML is misleading, isn't it? Would you be ok to change it into PDF2Text (as per text/plain or text/html)? it's a package class, thus changing the name should not be an issue. Ste On Fri, Mar 28, 2014 at 3:42 PM, Jukka Zitting <[email protected]>wrote: > Hi, > > On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari > <[email protected]> wrote: > > On #1 I am still wondering why for indexing we need structure > information. > > is there any particular reason? wouldn't make more sense to get just the > > text by default and only optionally getting the structure? > > The trouble is that then each parser would need to have code for > producing both text and XHTML. Since the overhead of producing XHTML > instead of just text is pretty low, and since it's very easy for > clients that only care about the text output to just strip out the > markup, it made more sense to design the system to always produce > XHTML. > > The same applies for document metadata. All parsers produce as much > metadata as they can, but must clients will just ignore most or all of > the returned metadata fields. However, since the overhead of producing > all the information is lower than that of adding explicit options to > control which metadata needs to be extracted and returned, it makes > sense to to just let clients filter out those bits that they don't > care about. > >
