Chris Lewis created OPENNLP-738:
-----------------------------------
Summary: AbstractDataIndexer#sortAndMerge sets up callers for a
NullPointerException
Key: OPENNLP-738
URL: https://issues.apache.org/jira/browse/OPENNLP-738
Project: OpenNLP
Issue Type: Bug
Reporter: Chris Lewis
In its constructor, the {{OnePassDataIndexer}} calls {{sortAndMerge}} of its
parent class, {{AbstractDataIndexer}} (source file
{{opennlp-tools/src/main/java/opennlp/tools/ml/model/AbstractDataIndexer.java}}).
A quick read through the source of these two classes shows that the member
variable {{contexts}} is only initialized by this method, otherwise it remains
{{null}}. Note that in the case of {{sort}} being {{true}} (which it is as
called) and there being fewer than two events, the method returns early thus
leaving {{contexts}} unilitialized. Note also that {{getContexts}} exposes this
variable, and that {{GIS.trainModel}} delegates to the {{trainModel}} method of
{{GISTrainer}}. Line 263 attempts to dereference {{contexts.length}}, which
will be {{null}} in the case of fewer than two events in the stream, and thus
result in a {{NullPointerException}}.
I'm not an expert in the algorithms relying on this code, but
[some|http://comments.gmane.org/gmane.comp.apache.opennlp.user/564]
[googling|http://blog.gmane.org/gmane.comp.apache.opennlp.user/month=20140501]
shows a few incidents that lead back to this behavior, including at least the
tickets OPENNLP-316 and OPENNLP-488. It may be the case that all uses of this
code cannot possibly function correctly without >= 2 events, but I don't know
that. As such, being the non-expert on the natural constraints of the inputs to
{{sortAndMerge}}, I'd like to suggest 2 possible improvements: 1) default the
{{contexts}} and other private arrays that are set in the >= 2 path of this
code to non-null defaults or 2) throw an explicit {{IllegalArgumentException}}
that states >= 2 events are required for the calculation.
The latter is not as desirable as the former (for which I've attached a patch),
but at least it provides a targeted, unambiguous reason for why an exception is
being thrown.
Also I apologize for not specifying the version or component, as I'm not clear
on how the project source is organized with respect to the published artifacts.
This issue is present in trunk whose parent pom claims a version of
{{1.6.1-SNAPSHOT}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)