[ 
https://issues.apache.org/jira/browse/TIKA-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354470#comment-16354470
 ] 

Tim Allison edited comment on TIKA-2395 at 2/6/18 9:00 PM:
-----------------------------------------------------------

What's happening here is that the detectors are reading the full stream (when 
the file is short enough), which triggers AutoCloseInputStream to close.  Once 
it closes, the underlying {{ClosedInputStream}} can no longer be reset.

Or, in other words, we check that the stream supports {{mark}}, and because it 
does, we don't wrap it in a {{BufferedInputStream}}; we then read the entire 
stream, but then the underlying stream is changed to a {{ClosedInputStream}}, 
which doesn't support {{mark}}.

We used to wrap the stream in a {{BufferedInputStream}} based on class, not 
based on whether the class alleged that it supported mark (unless you read the 
whole stream! :D).

Recommendations?  

 


was (Author: [email protected]):
What's happening here is that the detectors are reading the full stream, which 
triggers AutoCloseInputStream to close.  Once it closes, the underlying 
{{ClosedInputStream}} can no longer be reset.

Or, in other words, we check that the stream supports {{mark}}, and because it 
does, we don't wrap it in a {{BufferedInputStream}}; we then read the entire 
stream, but then the underlying stream is changed to a {{ClosedInputStream}}, 
which doesn't support {{mark}}.

We used to wrap the stream in a {{BufferedInputStream}} based on class, not 
based on whether the class alleged that it supported mark (unless you read the 
whole stream! :D).

Recommendations?  

 

> The parser does not support InputStream without built in mark/reset support 
> anymore
> -----------------------------------------------------------------------------------
>
>                 Key: TIKA-2395
>                 URL: https://issues.apache.org/jira/browse/TIKA-2395
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.15
>            Reporter: Thomas Mortagne
>            Priority: Blocker
>
> After upgrade to 1.5 (from 1.4) it seems that the detector does not properly 
> support all kinds of InputStream like it used to.
> I get tons of:
> {noformat}
> org.apache.tika.io.TaggedIOException: mark/reset not supported
>       at 
> org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133)
>       at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:170)
>       at org.apache.tika.io.TikaInputStream.reset(TikaInputStream.java:673)
>       at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:474)
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115)
>       at org.apache.tika.Tika.parseToString(Tika.java:527)
>       at 
> org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getContentAsText(AbstractSolrMetadataExtractor.java:509)
>       at 
> org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setLocaleAndContentFields(AttachmentSolrMetadataExtractor.java:111)
>       at 
> org.xwiki.search.solr.internal.metadata.AttachmentSolrMetadataExtractor.setFieldsInternal(AttachmentSolrMetadataExtractor.java:93)
>       at 
> org.xwiki.search.solr.internal.metadata.AbstractSolrMetadataExtractor.getSolrDocument(AbstractSolrMetadataExtractor.java:133)
>       at 
> org.xwiki.search.solr.internal.DefaultSolrIndexer.getSolrDocument(DefaultSolrIndexer.java:504)
>       at 
> org.xwiki.search.solr.internal.DefaultSolrIndexer.processBatch(DefaultSolrIndexer.java:411)
>       at 
> org.xwiki.search.solr.internal.DefaultSolrIndexer.run(DefaultSolrIndexer.java:377)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: mark/reset not supported
>       at java.io.InputStream.reset(InputStream.java:348)
>       at 
> org.apache.commons.io.input.ProxyInputStream.reset(ProxyInputStream.java:169)
>       at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
>       ... 13 common frames omitted
> {noformat}
> This regression makes tika unusable for us.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to