[
https://issues.apache.org/jira/browse/TIKA-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365948#comment-17365948
]
Tim Allison commented on TIKA-3450:
-----------------------------------
Does it repro w tika-app: java -Xmx1g -jar tika-app.jar file.csv? Or with the
-J option?
Can you truncate or compress the example file so that we can get a sense of
what it looks like?
When you say it doesn’t repro with the regular /tika endpoint, what’s the
lowest -Xmx you can go without OOM?
> CSV parsing consumes an exorbitant amount of memory/heap space when using
> server JSON endpoint
> ----------------------------------------------------------------------------------------------
>
> Key: TIKA-3450
> URL: https://issues.apache.org/jira/browse/TIKA-3450
> Project: Tika
> Issue Type: Bug
> Components: parser, server
> Affects Versions: 1.27
> Reporter: Carey Halton
> Priority: Major
>
> We've observed an issue where parsing large CSV files for some reason takes a
> ridiculous amount of heap space and memory, seemingly unbounded (we haven't
> been able to set a threshold that they succeed under and we've gone up to 4gb
> heap space). This repros when we utilize Tika server's new JSON extraction
> endpoint, and it repros when we use the default TextAndCSVParser as well as
> if we configure it to use the older TXTParser instead. For some reason it
> doesn't repro when using a non-JSON extraction endpoint (though the request
> does still take a few minutes in that case), so I wonder if there is some
> recursion issue going on (didn't try with rmeta). In both cases it seems like
> the large file is being held as multiple character arrays in memory at once
> for some reason, and then there is also an extremely large object array that
> contains each character as a string and then some.
> I have a test file that reproduces the issue, but it looks like Jira won't
> let me upload it (it is just under the 60mb limit but I get the "An internal
> error has occurred." message). I also have sample repro heap dumps that I can
> share (one with each parser setup) but they are definitely too large to
> upload to Jira at all (since they are each approximately 4gb). Let me know if
> there is a way I can easily share these to help showcase the issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)