[
https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072854#comment-13072854
]
William Colen commented on OPENNLP-238:
---------------------------------------
I verified and the tagset of the corpus and of the dictionary are the same. But
I could find some issues:
- the training data is small (4k sentences);
- the tagset is big: over than 200 tags
- the corpus annotation can combine different tags and it would be difficult to
add that to the dictionary, only if I create the dictionary from the corpus,
but don't know if it is a good idea.
examples of combinations:
- when there is a noun (n) used as adjective (adj) the annotation is "n-adj"
and I don't that in the dictionary
- sometimes the corpus is not clear if something is singular (S) or plural (P)
according to the context, and the person/computer who annotated the corpus
added the tag S/P - I also don't have it in the dictionary.
- the same we have for person: we have 0/1/3 when it couldn't decide according
to the corpus or the word morphology the person of a verb.
What I'm trying to do is to define my own sequence validator that can handle
these cases.
> BestSequence method in BeamSearch can cause NullPointerException if it can
> not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-238
> URL: https://issues.apache.org/jira/browse/OPENNLP-238
> Project: OpenNLP
> Issue Type: Bug
> Components: POS Tagger
> Affects Versions: tools-1.5.2-incubating
> Reporter: William Colen
> Assignee: William Colen
> Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a
> TagDictionary. Sometimes there are no outcome that matches with the tags in
> the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty
> after advancing all valid sequences (line 159) we should let it advance
> invalid sequences. It would make the POS Tagger more robust.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira