[ https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955638#comment-17955638 ]
Tim Barrett commented on TIKA-4427: ----------------------------------- Setting the pool size to 0 just leads to a load of illegal argument exceptions: java.lang.ExceptionInInitializerError: Exception java.lang.IllegalArgumentException [in thread "main"] at java.util.concurrent.ArrayBlockingQueue.<init>(ArrayBlockingQueue.java:272) ~[?:?] I don’t think that setting it to 1 would achieve anything, as the one instance would be used by all threads and would continue to hold and build up data. My vote would be to dispense with the pooling. I’m building a data set now which has 1m documents and 1b words. No memory problems whatsoever since I removed the pooling in XMLReaderUtils, building at a rate of 27m sentences per hour, which is the rate I would expect - so no noticeable performance degradation from instantiating SAX parsers on demand. > Memory Leak when parsing a large (110K+) number of documents > -------------------------------------------------------------- > > Key: TIKA-4427 > URL: https://issues.apache.org/jira/browse/TIKA-4427 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 3.2.0 > Reporter: Tim Barrett > Priority: Major > Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot > 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png > > > When parsing a very large number of documents, which include a lot of eml > files we see that > The static field XMLReaderUtils.SAX_PARSERS is holding a massive amount of > memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of > which is holding onto substantial amounts of memory, apparently in the > fDocumentHandler field. > This is a big data test we run regularly, the memory issues did not occur in > Tika version 2.x > > I have attached JVM monitor screenshots. -- This message was sent by Atlassian Jira (v8.20.10#820010)