Tim Allison created TIKA-4618:
---------------------------------

             Summary: Improve spooling strategies in 4.x
                 Key: TIKA-4618
                 URL: https://issues.apache.org/jira/browse/TIKA-4618
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


On TIKA-4474, there's a request to spool zip based doc formats. With solid 
state drives, there just isn't the performance hit that there once was. We'd 
probably be generally better off spooling "random-access" file formats 
(zip-based, ole and pdf and ?).

 

I'm not sure if we do some simple "pre-detection" step to augment "maybeSpool" 
in the AutoDetectParser, or maybe we just beef up the detectors and allow 
configuration there so that the zip detector runs the strategy?

 

The idea would be to use the underlying file if it exists. If it doesn't, check 
that the stream is less than a threshold (default = 100kb?), and if so, don't 
spool...otherwise spool.

If anyone has any thoughts on the cleanest design, please offer input.

cc [~manish003] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to