[ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824568#comment-16824568
 ] 

Tim Allison commented on TIKA-2849:
-----------------------------------

[~boris-petrov] not for your use case, I agree.  

My initial reluctance to respond usefully to this ticket was that I have 
detect+parse as _the_ paradigmatic use case...it took me a while to fully 
comprehend how awful what we were doing would be for your use case...detection 
on a slow network drive.

So, if you are going to parse the file, too, then it is better to use 
TikaInputStream because that will spool the file (if it doesn't exist) and/or 
reuse the underlying file for detection and parsing.  Also, some _parsers_ 
memorymap the underlying file, so their profile is much kinder on RAM if an 
actual file is available, but for _detection_ alone, we can do better.

> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
>                 Key: TIKA-2849
>                 URL: https://issues.apache.org/jira/browse/TIKA-2849
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to