[ 
https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811110#comment-13811110
 ] 

Nick Burch commented on TIKA-1190:
----------------------------------

Unless we have the whole file, we can't accurately do the zip container 
detection. We need to buffer it if it's a stream, otherwise the data won't be 
there for the parser!

Isn't the right fix for people to just skip that Detector if they don't want 
the whole file used? (The container aware detectors all need the whole file, as 
that's largely the point of them).

I worry that people will get very confused if some kinds of TikaInputStream do 
correct detection, and others don't

> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-1190
>                 URL: https://issues.apache.org/jira/browse/TIKA-1190
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls 
> {{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP 
> archive) without using the {{hasFile()}} method to check whether a backing 
> file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the 
> entire stream getting spooled to a temporary file that might not be needed at 
> all after the detection.
> A better approach would be to only do the more detailed "full file" format 
> detection if the backing file is already available, i.e. if {{hasFile()}} 
> returns true.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to