Hi Jukka,
thanks a lot for your reply.

On #1 I am still wondering why for indexing we need structure information.
is there any particular reason? wouldn't make more sense to get just the
text by default and only optionally getting the structure?

On #2, I expected the code you presented would not work. And in fact the
pattern is quite odd, isn't it? What is the reason of throwing the
exception if limiting the text read is a legal use case? (I am asking just
to understand the background).

Ste

Ste


On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
> <[email protected]> wrote:
> > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an
> XHTML?
> > for the purpose of indexing, wouldn't just the text be enough?
>
> The XHTML output allows us to annotate the extracted text with
> structural information (like "this is a heading", "here's a
> hyperlink", etc.) that would be difficult to express with text-only
> output. A client that needs just the text content can get it easily
> with the BodyContentHandler class.
>
> > 2. I need to limit the index of the content to files whose size is below
> to
> > a certain threshold; I was wondering if this could be a parser
> > configuration option and thus if you would accept this change.
>
> Do you want to entirely exclude too large files, or just index the
> first few pages of such files (which is more common in many indexing
> use cases)?
>
> The latter use case be implemented with the writeLimit parameter of
> the WriteOutContentHandler class, like this:
>
>     // Extract up to 100k characters from a given document
>     WriteOutContentHandler out = new WriteOutContentHandler(100_000);
>     try {
>         parser.parse(..., new BodyContentHandler(out), ...);
>     } catch (SAXException e) {
>         if (!out.isWriteLimitReached(e)) {
>             throw e;
>         }
>     }
>     String content = out.toString();
>
> BR,
>
> Jukka Zitting
>

Reply via email to