[
https://issues.apache.org/jira/browse/TIKA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729706#comment-16729706
]
Caleb Ott commented on TIKA-2787:
---------------------------------
[~dgoldenberg123] I agree with what you are saying here. I have also had
similar issues noted in this ticket:
https://issues.apache.org/jira/browse/TIKA-2627.
A slightly nicer workaround is to use the "isWriteLimitReached" method on
WriteOutContentHandler. See the updated code.
{code:java}
WriteOutContentHandler writer = new WriteOutContentHandler(limit); // <-- e.g.
set to 1000000
ContentHandler handler = new BodyContentHandler(writer);
try {
parser.parse(dataStream, handler, metadata, parseCtx);
} catch (Exception ex) {
// Write limit exception could be wrapped in a TikaException
if (!writer.isWriteLimitReached(ex)) {
throw ex;
} else {
log.warn("TE limit reached on file {}.", filePath);
}
}
// Keep the extracted text regardless of WriteLimitReachedException
String text = handler.toString();
{code}
> Make WriteLimitReachedException public and not subclass of SAXException
> -----------------------------------------------------------------------
>
> Key: TIKA-2787
> URL: https://issues.apache.org/jira/browse/TIKA-2787
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.19.1
> Reporter: Dmitry Goldenberg
> Priority: Major
>
> The idea behind being able to set a limit on text extraction is to be able to
> get up to N characters extracted back. We just got tripped up by the fact
> that Tika throws an exception once the limit has been reached.
> This, in and of itself, is not a major hindrance especially since the error
> message itself clearly states that the extracted text is, "however,
> available".
> OK, but why is WriteLimitReachedException private? why not public so it can
> be explicitly caught when the parse() method is called? and why not add it to
> the signature of the parse method? I don't think it should extend
> SAXException, either; just cleanly throw it as is.
> Right now, our code makes this cumbersome adjustment around the condition:
> {code:java}
> ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to
> 1000000
> try {
> parser.parse(dataStream, handler, metadata, parseCtx);
> } catch (IOException | TikaException ex) {
> throw ex;
> } catch (SAXException ex) {
> String message = (ex.getMessage() == null) ? "" : ex.getMessage();
> if (!message.contains("Your document contained more than")) {
> throw new TikaException("Tika error has occurred.", ex);
> } else {
> log.warn("TE limit reached on file {}.", filePath);
> }
> }
> // Keep the extracted text regardless of WriteLimitReachedException
> String text = handler.toString();
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)