[jira] [Commented] (TIKA-3450) CSV parsing consumes an exorbitant amount of memory/heap space when using server JSON endpoint

Tim Allison (Jira) Wed, 30 Jun 2021 07:15:11 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372022#comment-17372022
 ]


Tim Allison commented on TIKA-3450:
-----------------------------------

I'm also not able to replicate that abysmal performance in Intellij.  The only 
thing I can think of is that your jvm's max heap is near what's required for 
the file and you're going into heavy garbage collection.

I did just (re-)discover this issue with Intellij not respecting surefire args: 
https://youtrack.jetbrains.com/issue/IDEA-272802  So, you might think you're 
changing the -Xmx, but Intellij might not be picking those up correctly.

> CSV parsing consumes an exorbitant amount of memory/heap space when using 
> server JSON endpoint
> ----------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3450
>                 URL: https://issues.apache.org/jira/browse/TIKA-3450
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.27
>            Reporter: Carey Halton
>            Priority: Major
>         Attachments: test large csv.zip
>
>
> We've observed an issue where parsing large CSV files for some reason takes a 
> ridiculous amount of heap space and memory, seemingly unbounded (we haven't 
> been able to set a threshold that they succeed under and we've gone up to 4gb 
> heap space). This repros when we utilize Tika server's new JSON extraction 
> endpoint, and it repros when we use the default TextAndCSVParser as well as 
> if we configure it to use the older TXTParser instead. For some reason it 
> doesn't repro when using a non-JSON extraction endpoint (though the request 
> does still take a few minutes in that case), so I wonder if there is some 
> recursion issue going on (didn't try with rmeta). In both cases it seems like 
> the large file is being held as multiple character arrays in memory at once 
> for some reason, and then there is also an extremely large object array that 
> contains each character as a string and then some.
> I have a test file that reproduces the issue, but it looks like Jira won't 
> let me upload it (it is just under the 60mb limit but I get the "An internal 
> error has occurred." message). I also have sample repro heap dumps that I can 
> share (one with each parser setup) but they are definitely too large to 
> upload to Jira at all (since they are each approximately 4gb). Let me know if 
> there is a way I can easily share these to help showcase the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3450) CSV parsing consumes an exorbitant amount of memory/heap space when using server JSON endpoint

Reply via email to