[
https://issues.apache.org/jira/browse/OPENNLP-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747137#comment-17747137
]
ASF GitHub Bot commented on OPENNLP-1505:
-----------------------------------------
mawiesne commented on PR #543:
URL: https://github.com/apache/opennlp/pull/543#issuecomment-1650455900
```
[INFO] Apache OpenNLP Reactor ............................. SUCCESS [ 26.105
s]
[INFO] Apache OpenNLP Tools ............................... SUCCESS [ 07:04
h]
[INFO] Apache OpenNLP UIMA Annotators ..................... SUCCESS [ 7.265
s]
[INFO] Apache OpenNLP Brat Annotator ...................... SUCCESS [ 2.839
s]
[INFO] Apache OpenNLP Morfologik Addon .................... SUCCESS [ 5.090
s]
[INFO] Apache OpenNLP Documentation ....................... SUCCESS [ 0.070
s]
[INFO] Apache OpenNLP Distribution ........................ SUCCESS [33:11
min]
[INFO] Apache OpenNLP DL .................................. SUCCESS [ 16.839
s]
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 07:38 h
[INFO] Finished at: 2023-07-25T19:17:37Z
[INFO]
------------------------------------------------------------------------
Finished: SUCCESS
```
^
> Reduce object creation in NGramCharModel and StringUtil
> -------------------------------------------------------
>
> Key: OPENNLP-1505
> URL: https://issues.apache.org/jira/browse/OPENNLP-1505
> Project: OpenNLP
> Issue Type: Improvement
> Components: Language Detector
> Affects Versions: 2.2.0
> Reporter: Martin Wiesner
> Assignee: Martin Wiesner
> Priority: Major
> Fix For: 2.2.1
>
>
> During a profiling session, I noticed that many tests in
> opennlp.tools.langdetect take quite some time for execution. Digging deeper
> into those tests, it quickly became obvious that StringUtil#toLowerCase() was
> creating new Strings for every call of this method (see
> NGramCharModel#add(...) lines 99 to 108.
> Being called in NGramCharModel quite frequently, this resulted in creation of
> millions of String objects during building ngrams for given input.
> Aims:
> * Reduce objection creation and thus creation of millions of string objects
> * Improve runtime of the langdetect tests (and potentially others)
> Idea:
> * Use (Heap)CharBuffer instead of String so that underlying char arrays can
> be re-used, instead of copying the chars over to a new string for each
> "toLowerCase"...
> Note:
> * A corresponding patch / PR should be tested with/against the Evaluation
> suite.
> Comments welcome.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)