[
https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195
]
LuĂs Filipe Nassif edited comment on TIKA-2849 at 4/10/20, 3:52 AM:
--------------------------------------------------------------------
Hi [~boris-petrov],
There are a number of Tika parsers that need a java.io.File because it is
needed by Tika's dependencies. Looking at current sources, I found File is
needed by parsers of rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe others...
Currently there is no way to know if a parser will spool the stream or not.
But, my organization have a project with a hard requirement to run a search
tool in computers/cellphones with very limited resources in the field, and we
prefer to receive an IOException("File size larger than max spool limit") from
parsers instead of waiting too long in dangerous places or exhausting computer
resources and crashing the app...
[~tallison], What do you think of a new TikaInputStream constructor that takes
the spool limit or some setMaxSpoolSize() method to set this limit? If reached
in getPath(), TikaInputStream should throw the IOException above. The problem
is that many parsers (and also thirdparty parsers) uses getPath(), not the new
getPath(maxBytes), and if this one is used, maxBytes should be received by
parsers in some way (parseContext?). I prefer the first approach...
If approved, I can code that, is simple.
was (Author: lfcnassif):
Hi [~boris-petrov],
There are a number of Tika parsers that need a java.io.File because it is
needed by Tika's dependencies. Looking at current sources, I found File is
needed by parsers of rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe others...
Currently there is no way to know if a parser will spool the stream or not.
But, my organization have a project with a hard requirement to run a search
tool in computers/cellphones with very limited resources in the field, and we
prefer to receive an IOException("File size larger than max spool limit") from
parsers instead of waiting too long in dangerous places or exhausting computer
resources and crashing the app...
[~tallison], What do you think of a new TikaInputStream constructor that takes
the spool limit or some setMaxSpoolSize() method to set this limit? If reached,
TikaInputStream should throw the IOException above.
If approved, I can code that, is simple.
> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
> Key: TIKA-2849
> URL: https://issues.apache.org/jira/browse/TIKA-2849
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.20
> Reporter: Boris Petrov
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream",
> execution gets to "TikaInputStream#getPath" which does a "Files.copy(in,
> path, REPLACE_EXISTING);" which is very, very bad. This input stream could
> be, as in our case, an input stream from a network file which is tens or
> hundreds of gigabytes large. Copying it locally is a huge waste of resources
> to say the least. Why does it do that and can I make it not do it? Or is this
> something that has to be fixed in Tika?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)