[PDFParser] - read limited number of characters

Stefano Fornari Sat, 29 Mar 2014 07:32:33 -0700

Hi Jukka, given we agree the pattern is not very nice, would you be ok to
hide it to client classes? I digged a bit more in the code and I found all
we need was already there. This is what I would propose:


1 promote WriteLimitReachedException to a public class
2 move the awkward trick into PDFParser as follows:

    metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
    extractMetadata(pdfDocument, metadata);
    try {
         PDF2XText.process(pdfDocument, handler, context, metadata,
localConfig);
    } catch (WriteLimitReachedException x) {
          //
          // This is a valid condition; just ignoring the exception
          //
    }

In this way the only think client classes should do is to use a limiting
BodyContentHandler:

@Test
    public void testLimitTextToParse() throws Exception {
        ContentHandler handler = new BodyContentHandler();

        new PDFParser().parse(
            getResourceAsStream("/test-documents/testPDF.pdf"),
            handler,
            new Metadata(),
            new ParseContext()
        );

        assertEquals(1067, handler.toString().length());

        handler = new BodyContentHandler(500);

        new PDFParser().parse(
            getResourceAsStream("/test-documents/testPDF.pdf"),
            handler,
            new Metadata(),
            new ParseContext()
        );

        assertEquals(500, handler.toString().length());
    }


One additional thing I would do is to change WriteOutContentHandler as per
the below:

/**
     * Writes the given characters to the given character stream.
     */
    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (writeLimit == -1 || writeCount + length <= writeLimit) {
            super.characters(ch, start, length);
            writeCount += length;
        } else {
            super.characters(ch, start, writeLimit - writeCount);
            writeCount = writeLimit;
            writeLimitReached = true;
            throw new WriteLimitReachedException(
                    "Your document contained more than " + writeLimit
                    + " characters, and so your requested limit has been"
                    + " reached. To receive the full text of the document,"
                    + " increase your limit. (Text up to the limit is"
                    + " however available).", tag);
        }
    }

    /**
     * Checks whether the given exception (or any of it's root causes) was
     * thrown by this handler as a signal of reaching the write limit.
     *
     * @since Apache Tika 0.7
     * @param t throwable
     * @return <code>true</code> if the write limit was reached,
     *         <code>false</code> otherwise
     *
     * Deprecated in Tika 1.6, use isWriteLimitReached(); the current
     * implementation ignores the given Throwable and is equivalent to
     * isWriteLimitReached()
     *
     */
    @Deprecated
    public boolean isWriteLimitReached(Throwable t) {
        return isWriteLimitReached();
    }

    /**
     * Returns true if the limit has been reached, false otherwise.
     *
     * @since Apache Tika 1.6
     * @return <code>true</code> if the write limit was reached,
     *         <code>false</code> otherwise
     */
    public boolean isWriteLimitReached() {
        return writeLimitReached;
    }


If you are ok with the changes for #1 and #2 I will be happy to provide a
patch.

Ste



> > On #2, I expected the code you presented would not work. And in fact the
> > pattern is quite odd, isn't it? What is the reason of throwing the
> > exception if limiting the text read is a legal use case? (I am asking
> just
> > to understand the background).
>
> Yes, the pattern is a bit awkward and generally shouldn't be
> recommended as it uses an exception to control the flow of the
> program. However, in this case we considered it worth doing as the
> alternative would have been far more complicated.
>
> Basically we wanted to avoid having to modify each parser
> implementation (even those implemented outside Tika...) to keep track
> of how much content has already been extracted and instead do that
> just once in the WriteOutContentHandler class. However, the only way
> for the WriteOutContentHandler to signal that parsing should be
> stopped is by throwing a SAXException, which is what we're doing here.
> By catching the exception and inspecting it with isWriteLimitReached()
> the client can determine whether this is what happened.
>
> BR,
>
> Jukka Zitting
>

[PDFParser] - read limited number of characters

Reply via email to