[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080610#comment-17080610 ]
Tim Allison commented on TIKA-2849: ----------------------------------- Sorry for my delay. I've been thinking about this without a great answer. [~boris-petrov], for detection, I can understand reading the first {{n}} bytes and wanting to limit that. For parsing, though, wouldn't you expect that you'd need to read the entire stream? [~lfcnassif], I'm somewhat regretting the max spool size (I regret adding complexity to TikaInputStream) now and would prefer to see if there might be other options... Let's get the 1.24.1 release out... Would it work to wrap your stream in a BoundedInputStream before sending to TikaInputStream? > TikaInputStream copies the input stream locally > ----------------------------------------------- > > Key: TIKA-2849 > URL: https://issues.apache.org/jira/browse/TIKA-2849 > Project: Tika > Issue Type: Bug > Affects Versions: 1.20 > Reporter: Boris Petrov > Assignee: Tim Allison > Priority: Major > Fix For: 1.21 > > > When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", > execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, > path, REPLACE_EXISTING);" which is very, very bad. This input stream could > be, as in our case, an input stream from a network file which is tens or > hundreds of gigabytes large. Copying it locally is a huge waste of resources > to say the least. Why does it do that and can I make it not do it? Or is this > something that has to be fixed in Tika? -- This message was sent by Atlassian Jira (v8.3.4#803005)