[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

Tim Allison (Jira) Fri, 10 Apr 2020 09:08:45 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080610#comment-17080610
 ]


Tim Allison commented on TIKA-2849:
-----------------------------------

Sorry for my delay. I've been thinking about this without a great answer.

[~boris-petrov], for detection, I can understand reading the first {{n}} bytes 
and wanting to limit that.  For parsing, though, wouldn't you expect that you'd 
need to read the entire stream?

[~lfcnassif], I'm somewhat regretting the max spool size (I regret adding 
complexity to TikaInputStream) now and would prefer to see if there might be 
other options... Let's get the 1.24.1 release out...

Would it work to wrap your stream in a BoundedInputStream before sending to 
TikaInputStream?


> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
>                 Key: TIKA-2849
>                 URL: https://issues.apache.org/jira/browse/TIKA-2849
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

Reply via email to