[
https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811245#comment-13811245
]
Jukka Zitting commented on TIKA-1190:
-------------------------------------
bq. We need to buffer it if it's a stream, otherwise the data won't be there
for the parser!
The parser can still buffer the stream to a temporary file even if the detector
doesn't do that. The only limitation for the parsing use case would be that the
AutoDetectParser might not be able to directly dispatch the document to the
correct parser, but it should be possible to work around that by doing the more
detailed type detection in PackageParser and re-dispatching the parsing of the
document if a more specific container format is detected.
bq. Isn't the right fix for people to just skip that Detector if they don't
want the whole file used?
Doing so would also drop the advanced type header detection by
commons-compress. That detection code doesn't need the whole file, but is also
too complex to express in the MIME magic database.
bq. I worry that people will get very confused if some kinds of TikaInputStream
do correct detection, and others don't
We already have the case that some kinds of InputStreams do correct detection
and others don't, and that seems to work just fine. Instead of saying to people
that just passing a TikaInputStream will give you advanced detection, it's IMHO
better to explain that advanced type detection is possible when the document is
available as a random-access file wrapped to a TikaInputStream. And it would
still possible for people to force the spooling (and thus enable the detailed
zip detection) by calling TikaInputStream.getFile() before passing the stream
to a detector.
> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -----------------------------------------------------------------------------
>
> Key: TIKA-1190
> URL: https://issues.apache.org/jira/browse/TIKA-1190
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls
> {{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP
> archive) without using the {{hasFile()}} method to check whether a backing
> file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the
> entire stream getting spooled to a temporary file that might not be needed at
> all after the detection.
> A better approach would be to only do the more detailed "full file" format
> detection if the backing file is already available, i.e. if {{hasFile()}}
> returns true.
--
This message was sent by Atlassian JIRA
(v6.1#6144)