Martin Wiesner created OPENNLP-1505:
---------------------------------------

             Summary: Reduce object creation in NGramCharModel and StringUtil
                 Key: OPENNLP-1505
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1505
             Project: OpenNLP
          Issue Type: Improvement
          Components: Language Detector
    Affects Versions: 2.2.0
            Reporter: Martin Wiesner
            Assignee: Martin Wiesner
             Fix For: 2.2.1


During a profiling session, I noticed that many tests in 
opennlp.tools.langdetect take quite some time for execution. Digging deeper 
into those tests, it quickly became obvious that StringUtil#toLowerCase() was 
creating new Strings for every call of this method (see NGramCharModel#add(...) 
lines 99 to 108.

Being called in NGramCharModel quite frequently, this resulted in creation of 
millions of String objects during building ngrams for given input.

Aims:
 * Reduce objection creation and thus creation of millions of string objects
 * Improve runtime of the langdetect tests (and potentially others)

Idea:
 * Use (Heap)CharBuffer instead of String so that underlying char arrays can be 
re-used, instead of copying the chars over to a new string for each 
"toLowerCase"...

Note:
 * A corresponding patch / PR should be tested with/against the Evaluation 
suite.

Comments welcome.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to