Hi Jukka,
I am splitting the thread.

Thanks to your explanation and playing with the code I understood better
how it works: basically it uses a SAX builder, than it depends by the
builder to add or not the XHTML markup. BodyContentHandler does not add the
markup -> plain text; ToXMLContentHandler adds the markcup -> XHTML.


Being that the case, the name PDF2XHTML is misleading, isn't it? Would you
be ok to change it into PDF2Text (as per text/plain or text/html)? it's a
package class, thus changing the name should not be an issue.


Ste


On Fri, Mar 28, 2014 at 3:42 PM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari
> <[email protected]> wrote:
> > On #1 I am still wondering why for indexing we need structure
> information.
> > is there any particular reason? wouldn't make more sense to get just the
> > text by default and only optionally getting the structure?
>
> The trouble is that then each parser would need to have code for
> producing both text and XHTML. Since the overhead of producing XHTML
> instead of just text is pretty low, and since it's very easy for
> clients that only care about the text output to just strip out the
> markup, it made more sense to design the system to always produce
> XHTML.
>
> The same applies for document metadata. All parsers produce as much
> metadata as they can, but must clients will just ignore most or all of
> the returned metadata fields. However, since the overhead of producing
> all the information is lower than that of adding explicit options to
> control which metadata needs to be extracted and returned, it makes
> sense to to just let clients filter out those bits that they don't
> care about.
>
>

Reply via email to