Re: PDF parser (two more questions)

Jukka Zitting Thu, 27 Mar 2014 15:56:31 -0700

Hi,

On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
<[email protected]> wrote:
> 1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML?
> for the purpose of indexing, wouldn't just the text be enough?


The XHTML output allows us to annotate the extracted text with
structural information (like "this is a heading", "here's a
hyperlink", etc.) that would be difficult to express with text-only
output. A client that needs just the text content can get it easily
with the BodyContentHandler class.

> 2. I need to limit the index of the content to files whose size is below to
> a certain threshold; I was wondering if this could be a parser
> configuration option and thus if you would accept this change.

Do you want to entirely exclude too large files, or just index the
first few pages of such files (which is more common in many indexing
use cases)?

The latter use case be implemented with the writeLimit parameter of
the WriteOutContentHandler class, like this:

    // Extract up to 100k characters from a given document
    WriteOutContentHandler out = new WriteOutContentHandler(100_000);
    try {
        parser.parse(..., new BodyContentHandler(out), ...);
    } catch (SAXException e) {
        if (!out.isWriteLimitReached(e)) {
            throw e;
        }
    }
    String content = out.toString();

BR,

Jukka Zitting

Re: PDF parser (two more questions)

Reply via email to