Re: PDF parser (two more questions)

Konstantin Gribov Fri, 28 Mar 2014 03:04:29 -0700

Exception is rethrown only if write limit not reached. So if exception was
on first 100k chars it affects the result. If exception is thrown after
that -- it will be suppressed.


-- 
Best regards,
Konstantin Gribov.
28.03.2014 13:32 пользователь "Stefano Fornari" <[email protected]>
написал:

> Hi Jukka,
> thanks a lot for your reply.
>
> On #1 I am still wondering why for indexing we need structure information.
> is there any particular reason? wouldn't make more sense to get just the
> text by default and only optionally getting the structure?
>
> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).
>
> Ste
>
> Ste
>
>
> On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting <[email protected]
> >wrote:
>
> > Hi,
> >
> > On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
> > <[email protected]> wrote:
> > > 1. is the use of PDF2XHTML necessary? why is the pdf turned into an
> > XHTML?
> > > for the purpose of indexing, wouldn't just the text be enough?
> >
> > The XHTML output allows us to annotate the extracted text with
> > structural information (like "this is a heading", "here's a
> > hyperlink", etc.) that would be difficult to express with text-only
> > output. A client that needs just the text content can get it easily
> > with the BodyContentHandler class.
> >
> > > 2. I need to limit the index of the content to files whose size is
> below
> > to
> > > a certain threshold; I was wondering if this could be a parser
> > > configuration option and thus if you would accept this change.
> >
> > Do you want to entirely exclude too large files, or just index the
> > first few pages of such files (which is more common in many indexing
> > use cases)?
> >
> > The latter use case be implemented with the writeLimit parameter of
> > the WriteOutContentHandler class, like this:
> >
> >     // Extract up to 100k characters from a given document
> >     WriteOutContentHandler out = new WriteOutContentHandler(100_000);
> >     try {
> >         parser.parse(..., new BodyContentHandler(out), ...);
> >     } catch (SAXException e) {
> >         if (!out.isWriteLimitReached(e)) {
> >             throw e;
> >         }
> >     }
> >     String content = out.toString();
> >
> > BR,
> >
> > Jukka Zitting
> >
>

Re: PDF parser (two more questions)

Reply via email to