Spyros Kapnissis created LUCENE-10171:
-----------------------------------------
Summary: Caching issue on dictionary-based
OpenNLPLemmatizerFilterFactory
Key: LUCENE-10171
URL: https://issues.apache.org/jira/browse/LUCENE-10171
Project: Lucene - Core
Issue Type: Bug
Components: modules/analysis
Affects Versions: 8.10, 7.7.3, main (9.0)
Reporter: Spyros Kapnissis
When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory
caches internally only the string format of the dictionary, and not the
DictionaryLemmatizer object. This results in parsing and creating a new
DictionaryLemmatizer object every time the
OpenNLPLemmatizerFilterFactory.create() is called.
In our case, with a large lemmas.txt file (5MB) and the OpenNLPLemmatizerFilter
used in many fields across our setup and in multiple collections (we use Solr),
we had several random OOM issues and generally high server load due to GC
activity. After heap dump analysis we noticed few thousands of
DictionaryLemmatizer instances of around 80MB each.
By switching the caching to the DictionaryLemmatizer instead of the String, we
were able to resolve these issues. I will be attaching a PR for review, please
let me know of any comments.
Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]