Martin Wiesner created OPENNLP-1505:
---------------------------------------
Summary: Reduce object creation in NGramCharModel and StringUtil
Key: OPENNLP-1505
URL: https://issues.apache.org/jira/browse/OPENNLP-1505
Project: OpenNLP
Issue Type: Improvement
Components: Language Detector
Affects Versions: 2.2.0
Reporter: Martin Wiesner
Assignee: Martin Wiesner
Fix For: 2.2.1
During a profiling session, I noticed that many tests in
opennlp.tools.langdetect take quite some time for execution. Digging deeper
into those tests, it quickly became obvious that StringUtil#toLowerCase() was
creating new Strings for every call of this method (see NGramCharModel#add(...)
lines 99 to 108.
Being called in NGramCharModel quite frequently, this resulted in creation of
millions of String objects during building ngrams for given input.
Aims:
* Reduce objection creation and thus creation of millions of string objects
* Improve runtime of the langdetect tests (and potentially others)
Idea:
* Use (Heap)CharBuffer instead of String so that underlying char arrays can be
re-used, instead of copying the chars over to a new string for each
"toLowerCase"...
Note:
* A corresponding patch / PR should be tested with/against the Evaluation
suite.
Comments welcome.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)