Interesting!

I'm not sure the above test is realistic because some conditions look extreme 
to me such as ITERATIONS = 100_000_000, testing against repetition of one of 
eos chars (e.g. Testing with: ....................), using only 
Factory.defaultEosCharacters which has three eos chars only.

If I change the code as follows:

```
  private static final int ITERATIONS = 10000;

  public static void main(String[] args) {
    // use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
    eosCharacters = new HashSet<>();
    for (char eosChar: Factory.ptEosCharacters) {
      eosCharacters.add(eosChar);
    }

    // use normal sentences rather than ....................
    char[] cbuf = new String("I think you are better off sending an email to 
the solr-user mailing " +
        "list (http://lucene.apache.org/solr/community.html#mailing-lists-irc) 
and explaining " +
        "more about your use case so we can understand what leads up to the 
dump. Most likely you " +
        "will find ways to reconfigure your cluster or queries in a way that 
avoids this situation. " +
        "Or perhaps your cluster is simply under-dimensioned.").toCharArray();
    testBuffer(cbuf);
  }

  public static List<Integer> getPositionsArray(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    // use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
    char[] eosCharacters = Factory.ptEosCharacters;
    for (int i = 0; i < cbuf.length; i++) {
      for (char eosCharacter : eosCharacters) {
        if (cbuf[i] == eosCharacter) {
          l.add(i);
          break;
        }
      }
    }
    return l;
  }
```

I got the following result which shows opposite:

```
Duration array (ms): 197
Duration set (ms): 73
```

But I think your feedback is very interesting and highly appreciated. Thank 
you. :)

[ Full content available at: https://github.com/apache/opennlp/pull/329 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to