[jira] [Commented] (TIKA-4427) Memory Leak when parsing a large (110K+) number of documents

Tim Barrett (Jira) Mon, 02 Jun 2025 03:11:09 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955638#comment-17955638
 ]


Tim Barrett commented on TIKA-4427:
-----------------------------------

Setting the pool size to 0 just leads to a load of illegal  argument exceptions:
java.lang.ExceptionInInitializerError: Exception 
java.lang.IllegalArgumentException [in thread "main"]

at java.util.concurrent.ArrayBlockingQueue.<init>(ArrayBlockingQueue.java:272) 
~[?:?]

I don’t think that setting it to 1 would achieve anything, as the one instance 
would be used by all threads and would continue to hold and build up data.

My vote would be to dispense with the pooling. I’m building a data set now 
which has 1m documents and 1b words. No memory problems whatsoever since I 
removed the pooling in XMLReaderUtils, building at a rate of 27m sentences per 
hour, which is the rate I would expect - so no noticeable performance 
degradation from instantiating SAX parsers on demand.

 

> Memory Leak when parsing a large (110K+)  number of documents 
> --------------------------------------------------------------
>
>                 Key: TIKA-4427
>                 URL: https://issues.apache.org/jira/browse/TIKA-4427
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.2.0
>            Reporter: Tim Barrett
>            Priority: Major
>         Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot 
> 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png
>
>
> When parsing a very large number of documents, which include a lot of eml 
> files we see that  
> The static field XMLReaderUtils.SAX_PARSERS  is holding a massive amount of 
> memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of 
> which is holding onto substantial amounts of memory, apparently in the 
> fDocumentHandler field.
> This is a big data test we run regularly, the memory issues did not occur in 
> Tika version 2.x
>  
> I have attached JVM monitor screenshots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4427) Memory Leak when parsing a large (110K+) number of documents

Reply via email to