[
https://issues.apache.org/jira/browse/OPENNLP-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072966#comment-13072966
]
William Colen commented on OPENNLP-238:
---------------------------------------
I was using the standard sequence validator. But now I am using one with some
hacks, like to handle "n-adj' and tags with "/".
Here is an example I found while running cross validator using the best corpus
I have (the Bosque is a newspaper based 4k sentences human reviewed corpus).
At some point we have the following sentence:
(...) rios, lagos, cachoeiras, montanhas (acompanhas da altitude), parques
nacionais, reservas (...)
that can be translated to "rivers, lakes, waterfalls, mountains (altitude
track), national parks, reserves".
The word "acompanhas" makes no sense here, although it is not misspelled,
"acompanhas" is the verb "to follow" in the present, second person singular
(v-fin=PR=2S=IND), but I think the right word here should be "acompanhadas",
that is the same verb but in the past participle (v-pcp=M=P).
The annotated sentence from corpus is:
(...) rios_n=M=P ,_, lagos_n=M=P ,_, cachoeiras_n=F=P ,_, montanhas_n=F=P (_(
acompanhas_v-pcp=M=P de_prp a_art=F=S altitude_n=F=S )_) ,_, parques_n=M=P
nacionais_adj=M=P ,_, reservas_n=F=P (...)
So the software that originally created the corpus or the person who reviewed
it used the POS tag according to the context, not restricting it to the
morphology of "acompanhas".
While running OpenNLP in this phrase it evaluate a huge list of outcomes, but
none of them is "v-fin=PR=2S=IND" (I will include all outcomes bellow). It
makes sense because we shouldn't have it in this context. Since the default
sequence validator performs a dictionary search and the correct tag of
"acompanhas" is in the outcome list, it will not validate any outcome and the
list will end empty, causing an exception latter.
--- outcomes
prop=F=S, n=F=S, v-pcp=F=S, adv, v-fin=PR=3S=IND, art=M=S, n=M=S, adj=M=S, :,
v-ger, art=F=S, adj=F=S, ,, (, num=M=P, n=M=P, ), prp, art=M=P, prop=M=S, .,
conj-s, pron-pers=M=3P=NOM, pron-pers=M=3P=ACC, v-fin=PR=3P=IND, conj-c,
v-fin=PS/MQP=3P=IND, pron-indp=M=S, v-inf, «, », v-fin=PS=3S=IND,
v-fin=FUT=3S=IND, n=F=P, adj=F=P, v-pcp=M=P, v-pcp=M=S, v-inf=3S, pron-det=M=S,
v-fin=IMPF=3S=IND, ec, adj=M=P, pron-det=F=P, pron-indp=F=P, v-fin=IMPF=3P=IND,
v-pcp=F=P, num=M=S, pron-indp=M/F=S, pron-pers=M=3S=NOM, --, pron-det=M=P,
n-adj=M=P, v-fin=COND=3P, art=F=P, num=F=P, pron-indp=F=S, v-fin=PR=1S=IND,
pron-pers=M/F=3S/P=ACC, v-fin=COND=3S, n-adj=M=S, n-adj=F=P, prop=M=P,
pron-det=F=S, v-fin=PR=3S=SUBJ, pron-pers=M=3S=ACC, v-fin=IMPF=3S=SUBJ,
num=F=S, conj-c=<co-postnom, pron-indp=M=P, v-fin=IMPF=3P=SUBJ, adj,
pron-pers=M=3S/P=ACC, v-fin=PR=3P=SUBJ, v-fin=PS=1/3S=IND, pron-pers=F=3S=ACC,
pron-pers=M=3S=NOM/PIV, pron-pers=M/F=1S=DAT, v-fin=PS=1S=IND,
pron-pers=M=3S=DAT, v-pcp, v-fin=FUT=3P=IND, v-inf=3P, pron-pers=F=3S=NOM/PIV,
;, ', prop=F=P, v-fin=PS=1P=IND, art=N=S, ?, v-fin=PR=1P=IND, !,
pron-pers=F=3S=NOM, pron-pers=M/F=3S=ACC, prp=N<ARG, v-fin=FUT=3S=SUBJ,
pron-pers=M=1P=NOM, pron-pers=M/F=1P=NOM/PIV, v-fin=MQP=3S=IND,
v-fin=PS=2S=IND, pron-pers=M=3P=NOM/PIV, P.vp, pron-pers=M=1S=DAT,
pron-pers=M=1S=ACC, pron-pers=F=1S=ACC, adj=M/F=S, pron-pers=F=3P=ACC,
v-fin=IMP=2S, intj, n=M/F=S, pron-pers=M/F=3S=NOM, v-fin=PR=1P=SUBJ,
pron-pers=F=3P=NOM/PIV, v-fin=FUT=1P=IND, pron-pers=M/F=1P=ACC, prop=M/F=S,
pron-pers=M/F=3S=NOM/PIV, v-fin=PR=1/3S=SUBJ, pron-pers=M/F=1S=NOM,
v-fin=IMPF=1S=SUBJ, v-fin=IMPF=1S=IND, pron-pers=F=3P=NOM, ...,
pron-pers=M=1S=NOM, pron-pers=F=3S=DAT, v-fin=FUT=1/3S=SUBJ, num=M/F=P,
n-adj=F=S, n=M=R, conj-c=<co-prparg, pron-pers=M/F=1P=NOM, v-inf=M=S, v-inf=1P,
v-fin=IMPF=1P=IND, -, pron-pers=M=3P=DAT, pron-pers=M/F=1S=ACC,
pron-indp=M/F=S/P, v-fin=MQP=3P=IND, pron-pers=F=1S=DAT, pron-pers=F=1S=PIV,
v-fin=PR=1S=SUBJ, /, v-fin=PR=2P=IND, pron-pers=M/F=2P=NOM, v-fin=COND=1S,
pron-pers=F=1S=NOM, v-fin=FUT=3P=SUBJ, pron-indp=M=S/P, n=M/F=P,
pron-pers=M=3S=PIV, v-fin=FUT=1S=IND, v-inf=1S, pron-pers=M/F=3S=DAT,
v-fin=FUT=1P=SUBJ, pron-pers=M=1P=DAT, v-fin=MQP=1S=IND, v-ger=F=S, n=N=M/F=S,
v-fin=IMP=3P, intj=PS=3S=IND, pron-indp=F=F, pron-pers=F=1P=NOM/PIV,
pron-pers=M/F=1P=DAT, vp=V=PCP=F=S, n=S=S, v-fin=PR=3S, pron-pers=M=1S=PIV,
pron-pers=M/F=3S/P=DAT, v-fin=PS=3P=IND, v-fin=PR=3S=IND=VFIN
> BestSequence method in BeamSearch can cause NullPointerException if it can
> not find a valid sequence
> ----------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-238
> URL: https://issues.apache.org/jira/browse/OPENNLP-238
> Project: OpenNLP
> Issue Type: Bug
> Components: POS Tagger
> Affects Versions: tools-1.5.2-incubating
> Reporter: William Colen
> Assignee: William Colen
> Fix For: tools-1.5.2-incubating
>
>
> I am using the standard sequence validator of POS Tagger with a
> TagDictionary. Sometimes there are no outcome that matches with the tags in
> the dictionary. That is causing a NullPointerException in bestSequence method.
> I think we should add an extra validation: if the heap 'next' still empty
> after advancing all valid sequences (line 159) we should let it advance
> invalid sequences. It would make the POS Tagger more robust.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira