Hi Jukka, given we agree the pattern is not very nice, would you be ok to
hide it to client classes? I digged a bit more in the code and I found all
we need was already there. This is what I would propose:
1 promote WriteLimitReachedException to a public class
2 move the awkward trick into PDFParser as follows:
metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
extractMetadata(pdfDocument, metadata);
try {
PDF2XText.process(pdfDocument, handler, context, metadata,
localConfig);
} catch (WriteLimitReachedException x) {
//
// This is a valid condition; just ignoring the exception
//
}
In this way the only think client classes should do is to use a limiting
BodyContentHandler:
@Test
public void testLimitTextToParse() throws Exception {
ContentHandler handler = new BodyContentHandler();
new PDFParser().parse(
getResourceAsStream("/test-documents/testPDF.pdf"),
handler,
new Metadata(),
new ParseContext()
);
assertEquals(1067, handler.toString().length());
handler = new BodyContentHandler(500);
new PDFParser().parse(
getResourceAsStream("/test-documents/testPDF.pdf"),
handler,
new Metadata(),
new ParseContext()
);
assertEquals(500, handler.toString().length());
}
One additional thing I would do is to change WriteOutContentHandler as per
the below:
/**
* Writes the given characters to the given character stream.
*/
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if (writeLimit == -1 || writeCount + length <= writeLimit) {
super.characters(ch, start, length);
writeCount += length;
} else {
super.characters(ch, start, writeLimit - writeCount);
writeCount = writeLimit;
writeLimitReached = true;
throw new WriteLimitReachedException(
"Your document contained more than " + writeLimit
+ " characters, and so your requested limit has been"
+ " reached. To receive the full text of the document,"
+ " increase your limit. (Text up to the limit is"
+ " however available).", tag);
}
}
/**
* Checks whether the given exception (or any of it's root causes) was
* thrown by this handler as a signal of reaching the write limit.
*
* @since Apache Tika 0.7
* @param t throwable
* @return <code>true</code> if the write limit was reached,
* <code>false</code> otherwise
*
* Deprecated in Tika 1.6, use isWriteLimitReached(); the current
* implementation ignores the given Throwable and is equivalent to
* isWriteLimitReached()
*
*/
@Deprecated
public boolean isWriteLimitReached(Throwable t) {
return isWriteLimitReached();
}
/**
* Returns true if the limit has been reached, false otherwise.
*
* @since Apache Tika 1.6
* @return <code>true</code> if the write limit was reached,
* <code>false</code> otherwise
*/
public boolean isWriteLimitReached() {
return writeLimitReached;
}
If you are ok with the changes for #1 and #2 I will be happy to provide a
patch.
Ste
> > On #2, I expected the code you presented would not work. And in fact the
> > pattern is quite odd, isn't it? What is the reason of throwing the
> > exception if limiting the text read is a legal use case? (I am asking
> just
> > to understand the background).
>
> Yes, the pattern is a bit awkward and generally shouldn't be
> recommended as it uses an exception to control the flow of the
> program. However, in this case we considered it worth doing as the
> alternative would have been far more complicated.
>
> Basically we wanted to avoid having to modify each parser
> implementation (even those implemented outside Tika...) to keep track
> of how much content has already been extracted and instead do that
> just once in the WriteOutContentHandler class. However, the only way
> for the WriteOutContentHandler to signal that parsing should be
> stopped is by throwing a SAXException, which is what we're doing here.
> By catching the exception and inspecting it with isWriteLimitReached()
> the client can determine whether this is what happened.
>
> BR,
>
> Jukka Zitting
>