[
https://issues.apache.org/jira/browse/DAFFODIL-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18014734#comment-18014734
]
Steve Lawrence commented on DAFFODIL-3030:
------------------------------------------
I believe I've found a major source of our memory leaks. The core issue is that
calling DataProcessor.withFoo() can lead to memory leaks in some cases. Calling
it a lot can lead to a lot of memory leaks that lead to out of memory errors.
The core issue is that Daffodil uses ThreadLocals in a couple places. One is in
the "regexMatchState", which can grow to be pretty large for files that do a
lot of large regex matching. Another is within the Schematron and Xerces
validators.
A potential gotcha with ThreadLocals is the internal implementation is a map
where entries are weak but values are not. This means that entries can be
garbage collected and periodically bet set to null. But even though the entires
become null, the values are not and cannot be GCed. Instead, the ThreadLocal
periodically scans the map for entries that have been made null and then
removes the values, finally allowing them to be garbage collected. But this
requires actually calling a function on a ThreadLocal to trigger this (e.g.
get(), set(), remove()). If we don't call any of these, then the values will
persist in memory. Making things even worse, each Thread has a reference to all
its ThreadLocals, which means even if a ThreadLocal can no longer be directly
referenced, a reference to it still exists so it will never be GCed and it's
values never GCed. Ultimately, this means that if a we ever lose access to a
ThreadLocal, any values that were added to it but not explicitly removed could
be a memory leak.
And I think this is exactly what is happening.
For every test, our TDMLRunner does something like this:
val dp = originalDP.withValidation(validation)
val pr = dp.parse(...)
It then throws away dp, no longer having a need for it and expecting it to be
GCed.
But the problem is withValidation("xerces") causes a XercesValidator instance
to be stored in a ThreadLocal. Daffodil doesn't know the dp will no longer be
used, so it keeps the XercesValidator around in a ThreadLocal for future
parses. But since the TDML runner no longer uses the XercesValidator instance,
it just becomes a memory leak, lasting forever until the Thread exists.
Another problem, potentially a bigger memory leak is with the "regexMatchState"
in the data processor. This is a CharBuffer/LongBuffer tuple stored in a
ThreadLocal that can grow pretty large, especially if schemas have unbounded
regex length patterns. As with before, everytime we do withValidation() we
create a new DataProcessor which requires new regexMatchState buffers to be
allocated and the previous ones become a memory leak. Fortunately
regexMatchState is lazy, so it at least shouldn't affect schemas that don't use
regexs.
Note that not all of our ThreadLocal uses have this issue. For example, Parsers
that use a ThreadLocal are safe because calling DataProessor.withFoo will use
the same Parsers with the same ThreadLocals, so there is no memory leak.
Also, this really only occurs if you do a lot of DataProcessor.withFoo(). That
is generally rare, but buggy code, like in the TDML runner, could do this.
There may also be other scenarios where it seems like a reasonable thing to do
(DataProcessor.withFoo is supposed to efficient).
So the short term fix is likely to fix the TDML runner so it doesn't call
DataProcessor.withFoo() so much. A long term fix is to fix our thread locals,
or stop using them, so we do not leak memory.
> Investigate increased memory usage
> ----------------------------------
>
> Key: DAFFODIL-3030
> URL: https://issues.apache.org/jira/browse/DAFFODIL-3030
> Project: Daffodil
> Issue Type: Bug
> Reporter: Josh Adams
> Priority: Major
> Fix For: 4.0.0
>
>
> I'm seeing a significant increase in memory required for a particular DFDL
> schema project (P8). Prior to 4.0.0 the test suite (running "sbt test")
> required around 10GB of heap space to complete. After 4.0.0 it needs over
> 20GB.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)