[ 
https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811245#comment-13811245
 ] 

Jukka Zitting commented on TIKA-1190:
-------------------------------------

bq. We need to buffer it if it's a stream, otherwise the data won't be there 
for the parser!

The parser can still buffer the stream to a temporary file even if the detector 
doesn't do that. The only limitation for the parsing use case would be that the 
AutoDetectParser might not be able to directly dispatch the document to the 
correct parser, but it should be possible to work around that by doing the more 
detailed type detection in PackageParser and re-dispatching the parsing of the 
document if a more specific container format is detected.

bq. Isn't the right fix for people to just skip that Detector if they don't 
want the whole file used?

Doing so would also drop the advanced type header detection by 
commons-compress. That detection code doesn't need the whole file, but is also 
too complex to express in the MIME magic database.

bq. I worry that people will get very confused if some kinds of TikaInputStream 
do correct detection, and others don't

We already have the case that some kinds of InputStreams do correct detection 
and others don't, and that seems to work just fine. Instead of saying to people 
that just passing a TikaInputStream will give you advanced detection, it's IMHO 
better to explain that advanced type detection is possible when the document is 
available as a random-access file wrapped to a TikaInputStream. And it would 
still possible for people to force the spooling (and thus enable the detailed 
zip detection) by calling TikaInputStream.getFile() before passing the stream 
to a detector.

> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-1190
>                 URL: https://issues.apache.org/jira/browse/TIKA-1190
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls 
> {{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP 
> archive) without using the {{hasFile()}} method to check whether a backing 
> file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the 
> entire stream getting spooled to a temporary file that might not be needed at 
> all after the detection.
> A better approach would be to only do the more detailed "full file" format 
> detection if the backing file is already available, i.e. if {{hasFile()}} 
> returns true.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to