Paul Austin created OPENNLP-1357:
------------------------------------

             Summary: Use CharSequence to allow for memory management
                 Key: OPENNLP-1357
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1357
             Project: OpenNLP
          Issue Type: New Feature
          Components: Sentence Detector
    Affects Versions: 1.9.4
            Reporter: Paul Austin


Most of the classes in OpenNLP require the inputs to be as String, 
StringBuffer, or char[]. This means that you have to load all the data into 
memory.

Many of these cases (String and StringBuffer args) could be replaced with a 
single method that accepts CharSequence as a parameter.

For example DefaultEndOfSentenceScanner

 
{code:java}
 public List<Integer> getPositions(CharSequence s) {
    List<Integer> l = new ArrayList<>();
    for (int i = 0; i < s.length(); i++) {
      char c = s.charAt(i);
      if (eosCharacters.contains(c)) {
        l.add(i);
      }
    }
    return l;
  }
{code}
This would allow for users to manage the memory overhead for large data sets. 
And in some cases require less temporary memory conversion to char buffers.

Some code such as the SDContextGenerator already uses CharSequence.  However in 
SentenceDetectorME there is an unnecessary conversion to a StringBuffer. The sb 
isn't modified and the SDContextGenerator.getContext takes CharSequence as an 
arg and String is a CharSequence.

 
{code:java}
public Span[] sentPosDetect(String s) {
    sentProbs.clear();
    StringBuffer sb = new StringBuffer(s);{code}
 

I can create a pull request(s) for the above if you think it is useful.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to