Carey Halton created TIKA-3450:
----------------------------------

             Summary: CSV parsing consumes an exorbitant amount of memory/heap 
space when using server JSON endpoint
                 Key: TIKA-3450
                 URL: https://issues.apache.org/jira/browse/TIKA-3450
             Project: Tika
          Issue Type: Bug
          Components: parser, server
    Affects Versions: 1.27
            Reporter: Carey Halton


We've observed an issue where parsing large CSV files for some reason takes a 
ridiculous amount of heap space and memory, seemingly unbounded (we haven't 
been able to set a threshold that they succeed under and we've gone up to 4gb 
heap space). This repros when we utilize Tika server's new JSON extraction 
endpoint, and it repros when we use the default TextAndCSVParser as well as if 
we configure it to use the older TXTParser instead. For some reason it doesn't 
repro when using a non-JSON extraction endpoint (though the request does still 
take a few minutes in that case), so I wonder if there is some recursion issue 
going on (didn't try with rmeta). In both cases it seems like the large file is 
being held as multiple character arrays in memory at once for some reason, and 
then there is also an extremely large object array that contains each character 
as a string and then some.

I have a test file that reproduces the issue, but it looks like Jira won't let 
me upload it (it is just under the 60mb limit but I get the "An internal error 
has occurred." message). I also have sample repro heap dumps that I can share 
(one with each parser setup) but they are definitely too large to upload to 
Jira at all (since they are each approximately 4gb). Let me know if there is a 
way I can easily share these to help showcase the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to