[
https://issues.apache.org/jira/browse/TIKA-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13811304#comment-13811304
]
Nick Burch commented on TIKA-1190:
----------------------------------
bq. Doing so would also drop the advanced type header detection by
commons-compress. That detection code doesn't need the whole file, but is also
too complex to express in the MIME magic database.
Isn't the right fix then to pull out that part of the detector to a new one?
That would allow people to exclude the "needs full files" detectors like POIFS,
Zip, Vorbis etc, while still keeping the "needs the first bit of the file"
compress detection?
bq. IMHO better to explain that advanced type detection is possible when the
document is available as a random-access file wrapped to a TikaInputStream
That doesn't feel right to me, especially as some detectors may be able to work
with just a stream. I'd much rather we say Tika will do its best unless you
explicitly tell it otherwise. Remember back a few years to all the queries
on-list and in JIRA about incorrect detection for these container formats. My
belief is that most people asking for detection want the best answer available.
Those with special requirements (eg quickest close-enough in your case) I
believe should be explicitly asking for that, based on their specific
requirements, rather than changing the default for most people (including those
new to Tika who'll be confused)
> ZipContainerDetector.detect() can spool the entire stream to a temporary file
> -----------------------------------------------------------------------------
>
> Key: TIKA-1190
> URL: https://issues.apache.org/jira/browse/TIKA-1190
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.4
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
>
> As noted in a TODO comment, currently the {{ZipContainerDetector}} calls
> {{getFile()}} on a given {{TikaInputStream}} instance (that looks like a ZIP
> archive) without using the {{hasFile()}} method to check whether a backing
> file is actually available.
> This is troublesome as it can lead to unexpected performance loss due to the
> entire stream getting spooled to a temporary file that might not be needed at
> all after the detection.
> A better approach would be to only do the more detailed "full file" format
> detection if the backing file is already available, i.e. if {{hasFile()}}
> returns true.
--
This message was sent by Atlassian JIRA
(v6.1#6144)