[ 
https://issues.apache.org/jira/browse/TIKA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729706#comment-16729706
 ] 

Caleb Ott commented on TIKA-2787:
---------------------------------

[~dgoldenberg123] I agree with what you are saying here. I have also had 
similar issues noted in this ticket: 
https://issues.apache.org/jira/browse/TIKA-2627.

A slightly nicer workaround is to use the "isWriteLimitReached" method on 
WriteOutContentHandler. See the updated code.
{code:java}
WriteOutContentHandler writer = new WriteOutContentHandler(limit); // <-- e.g. 
set to 1000000
ContentHandler handler = new BodyContentHandler(writer); 
try {
    parser.parse(dataStream, handler, metadata, parseCtx);
} catch (Exception ex) {
    // Write limit exception could be wrapped in a TikaException
    if (!writer.isWriteLimitReached(ex)) {
        throw ex;
    } else {
        log.warn("TE limit reached on file {}.", filePath);
    }
}

// Keep the extracted text regardless of WriteLimitReachedException
String text = handler.toString();

{code}

> Make WriteLimitReachedException public and not subclass of SAXException
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2787
>                 URL: https://issues.apache.org/jira/browse/TIKA-2787
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.19.1
>            Reporter: Dmitry Goldenberg
>            Priority: Major
>
> The idea behind being able to set a limit on text extraction is to be able to 
> get up to N characters extracted back. We just got tripped up by the fact 
> that Tika throws an exception once the limit has been reached.
> This, in and of itself, is not a major hindrance especially since the error 
> message itself clearly states that the extracted text is, "however, 
> available".
> OK, but why is WriteLimitReachedException private? why not public so it can 
> be explicitly caught when the parse() method is called? and why not add it to 
> the signature of the parse method? I don't think it should extend 
> SAXException, either; just cleanly throw it as is.
> Right now, our code makes this cumbersome adjustment around the condition:
> {code:java}
> ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to 
> 1000000
> try {
>     parser.parse(dataStream, handler, metadata, parseCtx);
> } catch (IOException | TikaException ex) {
>     throw ex;
> } catch (SAXException ex) {
>     String message = (ex.getMessage() == null) ? "" : ex.getMessage();
>     if (!message.contains("Your document contained more than")) {
>         throw new TikaException("Tika error has occurred.", ex);
>     } else {
>         log.warn("TE limit reached on file {}.", filePath);
>     }
> }
> // Keep the extracted text regardless of WriteLimitReachedException
> String text = handler.toString();
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to