[
https://issues.apache.org/jira/browse/OPENNLP-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Zemerick closed OPENNLP-1357.
----------------------------------
> Use CharSequence to allow for memory management
> -----------------------------------------------
>
> Key: OPENNLP-1357
> URL: https://issues.apache.org/jira/browse/OPENNLP-1357
> Project: OpenNLP
> Issue Type: Improvement
> Components: Sentence Detector
> Affects Versions: 1.9.4
> Reporter: Paul Austin
> Assignee: Martin Wiesner
> Priority: Minor
> Fix For: 2.1.1
>
>
> Most of the classes in OpenNLP require the inputs to be as String,
> StringBuffer, or char[]. This means that you have to load all the data into
> memory.
> Many of these cases (String and StringBuffer args) could be replaced with a
> single method that accepts CharSequence as a parameter.
> For example DefaultEndOfSentenceScanner
>
> {code:java}
> public List<Integer> getPositions(CharSequence s) {
> List<Integer> l = new ArrayList<>();
> for (int i = 0; i < s.length(); i++) {
> char c = s.charAt(i);
> if (eosCharacters.contains(c)) {
> l.add(i);
> }
> }
> return l;
> }
> {code}
> This would allow for users to manage the memory overhead for large data sets.
> And in some cases require less temporary memory conversion to char buffers.
> Some code such as the SDContextGenerator already uses CharSequence. However
> in SentenceDetectorME there is an unnecessary conversion to a StringBuffer.
> The sb isn't modified and the SDContextGenerator.getContext takes
> CharSequence as an arg and String is a CharSequence.
>
> {code:java}
> public Span[] sentPosDetect(String s) {
> sentProbs.clear();
> StringBuffer sb = new StringBuffer(s);{code}
>
> I can create a pull request(s) for the above if you think it is useful.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)