[ 
https://issues.apache.org/jira/browse/OPENNLP-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637739#comment-16637739
 ] 

ASF GitHub Bot commented on OPENNLP-1214:
-----------------------------------------

kojisekig commented on issue #329: OPENNLP-1214: use hash to avoid linear 
search in DefaultEndOfSentence…
URL: https://github.com/apache/opennlp/pull/329#issuecomment-426866309
 
 
   Interesting!
   
   I'm not sure the above test is realistic because some conditions look 
extreme to me such as ITERATIONS = 100_000_000, testing against repetition of 
one of eos chars (e.g. Testing with: ....................), using only 
Factory.defaultEosCharacters which has three eos chars only.
   
   If I change the code as follows:
   
   ```
     private static final int ITERATIONS = 10000;
   
     public static void main(String[] args) {
       // use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
       eosCharacters = new HashSet<>();
       for (char eosChar: Factory.ptEosCharacters) {
         eosCharacters.add(eosChar);
       }
   
       // use normal sentences rather than ....................
       char[] cbuf = new String("I think you are better off sending an email to 
the solr-user mailing " +
           "list 
(http://lucene.apache.org/solr/community.html#mailing-lists-irc) and explaining 
" +
           "more about your use case so we can understand what leads up to the 
dump. Most likely you " +
           "will find ways to reconfigure your cluster or queries in a way that 
avoids this situation. " +
           "Or perhaps your cluster is simply 
under-dimensioned.").toCharArray();
       testBuffer(cbuf);
     }
   
     public static List<Integer> getPositionsArray(char[] cbuf) {
       List<Integer> l = new ArrayList<>();
       // use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
       char[] eosCharacters = Factory.ptEosCharacters;
       for (int i = 0; i < cbuf.length; i++) {
         for (char eosCharacter : eosCharacters) {
           if (cbuf[i] == eosCharacter) {
             l.add(i);
             break;
           }
         }
       }
       return l;
     }
   ```
   
   I got the following result which shows opposite:
   
   ```
   Duration array (ms): 197
   Duration set (ms): 73
   ```
   
   But I think your feedback is very interesting and highly appreciated. Thank 
you. :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> use hash to avoid linear search in DefaultEndOfSentenceScanner
> --------------------------------------------------------------
>
>                 Key: OPENNLP-1214
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1214
>             Project: OpenNLP
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.9.1
>
>
> When DefaultEndOfSentenceScanner scans a sentence, it uses linear search to 
> check if each characters in the sentence is one of eos characters. I think 
> we'd better use HashSet to keep eosCharacters instead of char[].
> In accordance with this replacement, I'd like to make 
> getEndOfSentenceCharacters() deprecated because it returns char[] and nobody 
> in OpenNLP calls it at present, and I'd like to add the equivalent method 
> which returns Set<Character> of eos chars. Though it cannot keep the order of 
> eos chars but I don't think it can be a problem anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to