[
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637739#comment-16637739
]
ASF GitHub Bot commented on OPENNLP-1214:
-----------------------------------------
kojisekig commented on issue #329: OPENNLP-1214: use hash to avoid linear
search in DefaultEndOfSentenceā¦
URL: https://github.com/apache/opennlp/pull/329#issuecomment-426866309
Interesting!
I'm not sure the above test is realistic because some conditions look
extreme to me such as ITERATIONS = 100_000_000, testing against repetition of
one of eos chars (e.g. Testing with: ....................), using only
Factory.defaultEosCharacters which has three eos chars only.
If I change the code as follows:
```
private static final int ITERATIONS = 10000;
public static void main(String[] args) {
// use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
eosCharacters = new HashSet<>();
for (char eosChar: Factory.ptEosCharacters) {
eosCharacters.add(eosChar);
}
// use normal sentences rather than ....................
char[] cbuf = new String("I think you are better off sending an email to
the solr-user mailing " +
"list
(http://lucene.apache.org/solr/community.html#mailing-lists-irc) and explaining
" +
"more about your use case so we can understand what leads up to the
dump. Most likely you " +
"will find ways to reconfigure your cluster or queries in a way that
avoids this situation. " +
"Or perhaps your cluster is simply
under-dimensioned.").toCharArray();
testBuffer(cbuf);
}
public static List<Integer> getPositionsArray(char[] cbuf) {
List<Integer> l = new ArrayList<>();
// use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
char[] eosCharacters = Factory.ptEosCharacters;
for (int i = 0; i < cbuf.length; i++) {
for (char eosCharacter : eosCharacters) {
if (cbuf[i] == eosCharacter) {
l.add(i);
break;
}
}
}
return l;
}
```
I got the following result which shows opposite:
```
Duration array (ms): 197
Duration set (ms): 73
```
But I think your feedback is very interesting and highly appreciated. Thank
you. :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --------------------------------------------------------------
>
> Key: OPENNLP-1214
> URL: https://issues.apache.org/jira/browse/OPENNLP-1214
> Project: OpenNLP
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Koji Sekiguchi
> Assignee: Koji Sekiguchi
> Priority: Minor
> Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to
> check if each characters in the sentence is one of eos characters. I think
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody
> in OpenNLP calls it at present, and I'd like to add the equivalent method
> which returns Set<Character> of eos chars. Though it cannot keep the order of
> eos chars but I don't think it can be a problem anyway.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)