[
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636768#comment-16636768
]
ASF GitHub Bot commented on OPENNLP-1214:
-----------------------------------------
autayeu commented on issue #329: OPENNLP-1214: use hash to avoid linear search
in DefaultEndOfSentence…
URL: https://github.com/apache/opennlp/pull/329#issuecomment-426588520
Given the goal of this improvement is to speed up, do you think below is a
realistic test? Do you think it applies across other JVMs?
```java
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.sentdetect.lang.Factory;
class Scratch {
private static final int ITERATIONS = 100_000_000;
private static Set<Character> eosCharacters;
public static void main(String[] args) {
eosCharacters = new HashSet<>();
for (char eosChar: Factory.defaultEosCharacters) {
eosCharacters.add(eosChar);
}
char[] cbuf = new char[20];
System.out.println("defaultEosCharacters");
for (char eos : Factory.defaultEosCharacters) {
Arrays.fill(cbuf, eos);
testBuffer(cbuf);
}
System.out.println("ptEosCharacters");
for (char eos : Factory.ptEosCharacters) {
Arrays.fill(cbuf, eos);
testBuffer(cbuf);
}
System.out.println("jpnEosCharacters");
for (char eos : Factory.jpnEosCharacters) {
Arrays.fill(cbuf, eos);
testBuffer(cbuf);
}
}
private static void testBuffer(char[] cbuf) {
System.out.println("Testing with: " + new String(cbuf));
{
long start = System.currentTimeMillis();
for (int n = 0; n < ITERATIONS; n++) {
getPositionsArray(cbuf);
}
long duration = System.currentTimeMillis() - start;
System.out.println("Duration array (ms): " + duration);
}
{
long start = System.currentTimeMillis();
for (int n = 0; n < ITERATIONS; n++) {
getPositionsHashset(cbuf);
}
long duration = System.currentTimeMillis() - start;
System.out.println("Duration set (ms): " + duration);
}
}
public static List<Integer> getPositionsArray(char[] cbuf) {
List<Integer> l = new ArrayList<>();
char[] eosCharacters = Factory.defaultEosCharacters;
for (int i = 0; i < cbuf.length; i++) {
for (char eosCharacter : eosCharacters) {
if (cbuf[i] == eosCharacter) {
l.add(i);
break;
}
}
}
return l;
}
public static List<Integer> getPositionsHashset(char[] cbuf) {
List<Integer> l = new ArrayList<>();
for (int i = 0; i < cbuf.length; i++) {
if (eosCharacters.contains(cbuf[i])) {
l.add(i);
}
}
return l;
}
}
```
```bash
"C:\Program Files\Java\jdk1.8.0_162\bin\java.exe" ....
defaultEosCharacters
Testing with: ....................
Duration array (ms): 16424
Duration set (ms): 25844
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 17498
Duration set (ms): 26696
Testing with: ????????????????????
Duration array (ms): 17948
Duration set (ms): 25391
ptEosCharacters
Testing with: ....................
Duration array (ms): 16975
Duration set (ms): 25442
Testing with: ????????????????????
Duration array (ms): 18012
Duration set (ms): 25529
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 17562
Duration set (ms): 25579
Testing with: ;;;;;;;;;;;;;;;;;;;;
Duration array (ms): 4040
Duration set (ms): 6223
Testing with: ::::::::::::::::::::
Duration array (ms): 3991
Duration set (ms): 6276
Testing with: ((((((((((((((((((((
Duration array (ms): 3980
Duration set (ms): 6185
Testing with: ))))))))))))))))))))
Duration array (ms): 4043
Duration set (ms): 6199
Testing with: ««««««««««««««««««««
Duration array (ms): 3971
Duration set (ms): 8503
Testing with: »»»»»»»»»»»»»»»»»»»»
Duration array (ms): 3960
Duration set (ms): 8587
Testing with: ''''''''''''''''''''
Duration array (ms): 3920
Duration set (ms): 5450
Testing with: """"""""""""""""""""
Duration array (ms): 3931
Duration set (ms): 5396
jpnEosCharacters
Testing with: 。。。。。。。。。。。。。。。。。。。。
Duration array (ms): 3974
Duration set (ms): 8616
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 3908
Duration set (ms): 9276
Testing with: ????????????????????
Duration array (ms): 3953
Duration set (ms): 9278
Process finished with exit code 0
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --------------------------------------------------------------
>
> Key: OPENNLP-1214
> URL: https://issues.apache.org/jira/browse/OPENNLP-1214
> Project: OpenNLP
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Koji Sekiguchi
> Assignee: Koji Sekiguchi
> Priority: Minor
> Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to
> check if each characters in the sentence is one of eos characters. I think
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody
> in OpenNLP calls it at present, and I'd like to add the equivalent method
> which returns Set<Character> of eos chars. Though it cannot keep the order of
> eos chars but I don't think it can be a problem anyway.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)