Chris Lewis created OPENNLP-738:
-----------------------------------

             Summary: AbstractDataIndexer#sortAndMerge sets up callers for a 
NullPointerException
                 Key: OPENNLP-738
                 URL: https://issues.apache.org/jira/browse/OPENNLP-738
             Project: OpenNLP
          Issue Type: Bug
            Reporter: Chris Lewis


In its constructor, the {{OnePassDataIndexer}} calls {{sortAndMerge}} of its 
parent class, {{AbstractDataIndexer}} (source file 
{{opennlp-tools/src/main/java/opennlp/tools/ml/model/AbstractDataIndexer.java}}).
 A quick read through the source of these two classes shows that the member 
variable {{contexts}} is only initialized by this method, otherwise it remains 
{{null}}. Note that in the case of {{sort}} being {{true}} (which it is as 
called) and there being fewer than two events, the method returns early thus 
leaving {{contexts}} unilitialized. Note also that {{getContexts}} exposes this 
variable, and that {{GIS.trainModel}} delegates to the {{trainModel}} method of 
{{GISTrainer}}. Line 263 attempts to dereference {{contexts.length}}, which 
will be {{null}} in the case of fewer than two events in the stream, and thus 
result in a {{NullPointerException}}.

I'm not an expert in the algorithms relying on this code, but 
[some|http://comments.gmane.org/gmane.comp.apache.opennlp.user/564] 
[googling|http://blog.gmane.org/gmane.comp.apache.opennlp.user/month=20140501] 
shows a few incidents that lead back to this behavior, including at least the 
tickets OPENNLP-316 and OPENNLP-488. It may be the case that all uses of this 
code cannot possibly function correctly without >= 2 events, but I don't know 
that. As such, being the non-expert on the natural constraints of the inputs to 
{{sortAndMerge}}, I'd like to suggest 2 possible improvements: 1) default the 
{{contexts}} and other private arrays that are set in the >= 2 path of this 
code to non-null defaults or 2) throw an explicit {{IllegalArgumentException}} 
that states >= 2 events are required for the calculation.

The latter is not as desirable as the former (for which I've attached a patch), 
but at least it provides a targeted, unambiguous reason for why an exception is 
being thrown.

Also I apologize for not specifying the version or component, as I'm not clear 
on how the project source is organized with respect to the published artifacts. 
This issue is present in trunk whose parent pom claims a version of 
{{1.6.1-SNAPSHOT}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to