[ https://issues.apache.org/jira/browse/TIKA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729706#comment-16729706 ]
Caleb Ott commented on TIKA-2787: --------------------------------- [~dgoldenberg123] I agree with what you are saying here. I have also had similar issues noted in this ticket: https://issues.apache.org/jira/browse/TIKA-2627. A slightly nicer workaround is to use the "isWriteLimitReached" method on WriteOutContentHandler. See the updated code. {code:java} WriteOutContentHandler writer = new WriteOutContentHandler(limit); // <-- e.g. set to 1000000 ContentHandler handler = new BodyContentHandler(writer); try { parser.parse(dataStream, handler, metadata, parseCtx); } catch (Exception ex) { // Write limit exception could be wrapped in a TikaException if (!writer.isWriteLimitReached(ex)) { throw ex; } else { log.warn("TE limit reached on file {}.", filePath); } } // Keep the extracted text regardless of WriteLimitReachedException String text = handler.toString(); {code} > Make WriteLimitReachedException public and not subclass of SAXException > ----------------------------------------------------------------------- > > Key: TIKA-2787 > URL: https://issues.apache.org/jira/browse/TIKA-2787 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 1.19.1 > Reporter: Dmitry Goldenberg > Priority: Major > > The idea behind being able to set a limit on text extraction is to be able to > get up to N characters extracted back. We just got tripped up by the fact > that Tika throws an exception once the limit has been reached. > This, in and of itself, is not a major hindrance especially since the error > message itself clearly states that the extracted text is, "however, > available". > OK, but why is WriteLimitReachedException private? why not public so it can > be explicitly caught when the parse() method is called? and why not add it to > the signature of the parse method? I don't think it should extend > SAXException, either; just cleanly throw it as is. > Right now, our code makes this cumbersome adjustment around the condition: > {code:java} > ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to > 1000000 > try { > parser.parse(dataStream, handler, metadata, parseCtx); > } catch (IOException | TikaException ex) { > throw ex; > } catch (SAXException ex) { > String message = (ex.getMessage() == null) ? "" : ex.getMessage(); > if (!message.contains("Your document contained more than")) { > throw new TikaException("Tika error has occurred.", ex); > } else { > log.warn("TE limit reached on file {}.", filePath); > } > } > // Keep the extracted text regardless of WriteLimitReachedException > String text = handler.toString(); > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)