[jira] [Comment Edited] (TIKA-2849) TikaInputStream copies the input stream locally

Jira Fri, 10 Apr 2020 06:51:46 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195
 ]


Luís Filipe Nassif edited comment on TIKA-2849 at 4/10/20, 1:50 PM:
--------------------------------------------------------------------

Hi [~boris-petrov],

There are a number of Tika parsers that need a java.io.File because it is 
needed by Tika's dependencies. Looking at current sources, I found File is 
needed by parsers for rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe 
others... Currently there is no way to know if a parser will spool the stream 
or not without looking the sources.

But, my organization have a project with a hard requirement to run a search 
tool in computers/cellphones with very limited resources in the field, and we 
prefer to receive an IOException("File size larger than max spool limit") from 
parsers instead of waiting too long in dangerous places or exhausting computer 
resources and crashing the app...

[~tallison], What do you think of a new TikaInputStream constructor that takes 
the spool limit or some setMaxSpoolSize() method to set this limit? If reached 
in getPath(), TikaInputStream should throw the IOException above. The problem 
is that many parsers (and also thirdparty parsers) uses getPath(), not the new 
getPath(maxBytes), and if this one is used, maxBytes should be received by 
parsers in some way (parseContext?). But I don't think parsers should take care 
of this, I prefer the first approach...

If approved, I can code that, is simple.


was (Author: lfcnassif):
Hi [~boris-petrov],

There are a number of Tika parsers that need a java.io.File because it is 
needed by Tika's dependencies. Looking at current sources, I found File is 
needed by parsers of rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe others... 
Currently there is no way to know if a parser will spool the stream or not.

But, my organization have a project with a hard requirement to run a search 
tool in computers/cellphones with very limited resources in the field, and we 
prefer to receive an IOException("File size larger than max spool limit") from 
parsers instead of waiting too long in dangerous places or exhausting computer 
resources and crashing the app...

[~tallison], What do you think of a new TikaInputStream constructor that takes 
the spool limit or some setMaxSpoolSize() method to set this limit? If reached 
in getPath(), TikaInputStream should throw the IOException above. The problem 
is that many parsers (and also thirdparty parsers) uses getPath(), not the new 
getPath(maxBytes), and if this one is used, maxBytes should be received by 
parsers in some way (parseContext?). I prefer the first approach...

If approved, I can code that, is simple.

> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
>                 Key: TIKA-2849
>                 URL: https://issues.apache.org/jira/browse/TIKA-2849
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", 
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, 
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could 
> be, as in our case, an input stream from a network file which is tens or 
> hundreds of gigabytes large. Copying it locally is a huge waste of resources 
> to say the least. Why does it do that and can I make it not do it? Or is this 
> something that has to be fixed in Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-2849) TikaInputStream copies the input stream locally

Reply via email to