[jira] [Commented] (TIKA-4427) Memory Leak when parsing a large (110K+) number of documents

Tim Allison (Jira) Sat, 31 May 2025 13:20:17 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955441#comment-17955441
 ]


Tim Allison commented on TIKA-4427:
-----------------------------------

Interesting. Thank you.

We did this for TIKA-2645 (now linked). At the time, we saw nearly 2x 
improvement in speed for xml root detection (read to the first entity and then 
stop). My guess is that the diffs may be in the wash for a full parse of an xml 
file, but we should test that on small and large xml files -- with both parsing 
and the original use case of xml root detection.

We did not add this complexity for kicks. :D I'd much prefer to get rid of it 
if it no longer serves.

If you set the pool size to 0, do you get the same outcome? I'll try this early 
this coming week and see what I find.



> Memory Leak when parsing a large (110K+)  number of documents 
> --------------------------------------------------------------
>
>                 Key: TIKA-4427
>                 URL: https://issues.apache.org/jira/browse/TIKA-4427
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.2.0
>            Reporter: Tim Barrett
>            Priority: Major
>         Attachments: Screenshot 2025-05-30 at 17.22.38.png, Screenshot 
> 2025-05-30 at 18.31.01.png, Screenshot 2025-05-30 at 18.31.47.png
>
>
> When parsing a very large number of documents, which include a lot of eml 
> files we see that  
> The static field XMLReaderUtils.SAX_PARSERS  is holding a massive amount of 
> memory: 3.28 GB. This is a static pool of cached SAXParser instances, each of 
> which is holding onto substantial amounts of memory, apparently in the 
> fDocumentHandler field.
> This is a big data test we run regularly, the memory issues did not occur in 
> Tika version 2.x
>  
> I have attached JVM monitor screenshots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4427) Memory Leak when parsing a large (110K+) number of documents

Reply via email to