Hi Jukka, any feedbacks? Ste
On Sat, Mar 29, 2014 at 3:31 PM, Stefano Fornari <[email protected]>wrote: > Hi Jukka, given we agree the pattern is not very nice, would you be ok to > hide it to client classes? I digged a bit more in the code and I found all > we need was already there. This is what I would propose: > > 1 promote WriteLimitReachedException to a public class > 2 move the awkward trick into PDFParser as follows: > > metadata.set(Metadata.CONTENT_TYPE, "application/pdf"); > extractMetadata(pdfDocument, metadata); > try { > PDF2XText.process(pdfDocument, handler, context, metadata, > localConfig); > } catch (WriteLimitReachedException x) { > // > // This is a valid condition; just ignoring the exception > // > } > > In this way the only think client classes should do is to use a limiting > BodyContentHandler: > > @Test > public void testLimitTextToParse() throws Exception { > ContentHandler handler = new BodyContentHandler(); > > new PDFParser().parse( > getResourceAsStream("/test-documents/testPDF.pdf"), > handler, > new Metadata(), > new ParseContext() > ); > > assertEquals(1067, handler.toString().length()); > > handler = new BodyContentHandler(500); > > new PDFParser().parse( > getResourceAsStream("/test-documents/testPDF.pdf"), > handler, > new Metadata(), > new ParseContext() > ); > > assertEquals(500, handler.toString().length()); > } > > > One additional thing I would do is to change WriteOutContentHandler as per > the below: > > /** > * Writes the given characters to the given character stream. > */ > @Override > public void characters(char[] ch, int start, int length) > throws SAXException { > if (writeLimit == -1 || writeCount + length <= writeLimit) { > super.characters(ch, start, length); > writeCount += length; > } else { > super.characters(ch, start, writeLimit - writeCount); > writeCount = writeLimit; > writeLimitReached = true; > throw new WriteLimitReachedException( > "Your document contained more than " + writeLimit > + " characters, and so your requested limit has been" > + " reached. To receive the full text of the document," > + " increase your limit. (Text up to the limit is" > + " however available).", tag); > } > } > > /** > * Checks whether the given exception (or any of it's root causes) was > * thrown by this handler as a signal of reaching the write limit. > * > * @since Apache Tika 0.7 > * @param t throwable > * @return <code>true</code> if the write limit was reached, > * <code>false</code> otherwise > * > * Deprecated in Tika 1.6, use isWriteLimitReached(); the current > * implementation ignores the given Throwable and is equivalent to > * isWriteLimitReached() > * > */ > @Deprecated > public boolean isWriteLimitReached(Throwable t) { > return isWriteLimitReached(); > } > > /** > * Returns true if the limit has been reached, false otherwise. > * > * @since Apache Tika 1.6 > * @return <code>true</code> if the write limit was reached, > * <code>false</code> otherwise > */ > public boolean isWriteLimitReached() { > return writeLimitReached; > } > > > If you are ok with the changes for #1 and #2 I will be happy to provide a > patch. > > Ste > > > >> > On #2, I expected the code you presented would not work. And in fact the >> > pattern is quite odd, isn't it? What is the reason of throwing the >> > exception if limiting the text read is a legal use case? (I am asking >> just >> > to understand the background). >> >> Yes, the pattern is a bit awkward and generally shouldn't be >> recommended as it uses an exception to control the flow of the >> program. However, in this case we considered it worth doing as the >> alternative would have been far more complicated. >> >> Basically we wanted to avoid having to modify each parser >> implementation (even those implemented outside Tika...) to keep track >> of how much content has already been extracted and instead do that >> just once in the WriteOutContentHandler class. However, the only way >> for the WriteOutContentHandler to signal that parsing should be >> stopped is by throwing a SAXException, which is what we're doing here. >> By catching the exception and inspecting it with isWriteLimitReached() >> the client can determine whether this is what happened. >> >> BR, >> >> Jukka Zitting >> > >
