[
https://issues.apache.org/jira/browse/UIMA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014118#comment-13014118
]
Nicolas Hernandez commented on UIMA-2106:
-----------------------------------------
As soon as I found how to assign me the task I can submit a patch. There are
two lines to change in org.apache.uima.examples.tagger.Viterbi.java
available_pos = word_probs.get("(");
->
available_pos.put("null", Double.MIN_VALUE);
possible_pos_next = word_probs.get("(");
->
possible_pos_next.put("null", Double.MIN_VALUE);
> Handling tokens not present in the language model (and also with no suffix
> present in the model) causes a null pointer exception in the tagger process
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: UIMA-2106
> URL: https://issues.apache.org/jira/browse/UIMA-2106
> Project: UIMA
> Issue Type: Bug
> Components: Sandbox-Tagger
> Affects Versions: 2.3
> Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu
> 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
> Reporter: Nicolas Hernandez
> Priority: Minor
> Fix For: 2.3
>
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> The HMMTagger Analysis Engine class uses the
> org.apache.uima.examples.tagger.Viterbi.java implementation to determine the
> pos tag list of a given sentence.
> In practice this implementation is partially dependant on the part of speech
> tagging (likewise the remaining HMMTagger classes actually).
> For exemple it makes strong assumptions on the kind of tokens it can take as
> input. It assumes no restriction about the token covertext values.
> It results in using some covertext probabilities for initialization or
> default value when the tagger processes an unknown token...
> As a consequence if the coveredText used for setting the default value is not
> present in the training model an error occurs. Roughly speaking, the process
> looks first for probability associated to the current token coverText, if the
> coverText is not present in the model, it looks in the model for the
> probability of its longest suffix, and finally if it does not found a match,
> the process assigns to the unknown coverText the probability of the arbitrary
> coverText : "("
> The problem is that if the probability of this coverText is not available in
> the model, the probability of the unknown token is not defined and a null
> pointer exception occurs latter when the variable is called.
> Why the probability of the "(" text would not be available in the model ? In
> a large training corpus if we consider all the tokens, there is little chance
> not to find at least one occurrence of "(".
> Nevertheless if we use the HMM training AE to build a model for predicting
> noun gender and number, or verb tense and person, or "being a part of" named
> entity... these tokens won t have the "(" coverText... and consequently an
> error will occurs when the tagging will be performed.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira