[ https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955441#comment-17955441 ]
Tim Allison commented on TIKA-4427: ----------------------------------- Interesting. Thank you. We did this for TIKA-2645 (now linked). At the time, we saw nearly 2x improvement in speed for xml root detection (read to the first entity and then stop). My guess is that the diffs may be in the wash for a full parse of an xml file, but we should test that on small and large xml files -- with both parsing and the original use case of xml root detection. We did not add this complexity for kicks. :D I'd much prefer to get rid of it if it no longer serves. If you set the pool size to 0, do you get the same outcome? I'll try this early this coming week and see what I find. > Memory Leak when parsing a large (110K+) number of documents > -------------------------------------------------------------- > > Key: TIKA-4427 > URL: https://issues.apache.org/jira/browse/TIKA-4427 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 3.2.0 > Reporter: Tim Barrett > Priority: Major > Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot > 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png > > > When parsing a very large number of documents, which include a lot of eml > files we see that > The static field XMLReaderUtils.SAX_PARSERS is holding a massive amount of > memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of > which is holding onto substantial amounts of memory, apparently in the > fDocumentHandler field. > This is a big data test we run regularly, the memory issues did not occur in > Tika version 2.x > > I have attached JVM monitor screenshots. -- This message was sent by Atlassian Jira (v8.20.10#820010)