[jira] [Commented] (OPENNLP-1309) NameFinderME - Unexpected result using unchanged training data

Jeffrey T. Zemerick (Jira) Thu, 24 Sep 2020 07:44:29 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201571#comment-17201571
 ]


Jeffrey T. Zemerick commented on OPENNLP-1309:
----------------------------------------------

Yes, that line means that during the training the improvement during the last 
iteration was so little that OpenNLP ended training early. With more training 
data it likely would have kept going for many more iterations.

The best way to gauge reliability is by evaluating the model you have trained. 
The output of the evaluation will be some scores that indicates the model's 
precision and recall metrics as well as the combined F-measure. When training a 
model, divide the training text into some split such as 80/20 where 80% of the 
text is used for training and 20% is held back and used for evaluation. After 
training and evaluating the model, you can adjust parameters or the training 
text itself to work toward improving the evaluation's metrics.

Here's an example of doing an evaluation: 
[https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.namefind.eval]

Getting sufficient annotated text data is often the biggest challenge in NLP. 
It's tedious to make yourself and hard to find elsewhere. I don't know if I 
have any larger samples but I will check to see.

 

> NameFinderME - Unexpected result using unchanged training data
> --------------------------------------------------------------
>
>                 Key: OPENNLP-1309
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1309
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: 1.9.2
>            Reporter: Michael
>            Priority: Major
>
>  
> Hello,
> Based on 
> [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java]
>  / function _testNameFinder()_, I have written a simple test code and changed 
> the [test 
> sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
>  from *(1)*:
> {code:java}
> String[] sentence = {"Alisa",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> {code}
> to *(2)*:
> {code:java}
> String[] sentence = {"Alisa",
>  "and",
>  "Mike",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> {code}
> (Just added "_and Mike_") and expected to get 2 results (two names _Alisa_ 
> and _Mike_) because both names are annotated in the training data. I just get 
> 1 result (Mike) for *(2)*. I used the training data file 
> [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt]
>   (unchanged).
> Can anyone tell me what's wrong? Thanks.
> h3. +Test code:+
>  
> {code:java}
> String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
> String encoding = "ISO-8859-1";
>  ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new 
> PlainTextByLineStream(new MarkableFileInputStreamFactory(new 
> File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
>  
>  TrainingParameters params = new TrainingParameters();
>  params.put(TrainingParameters.ITERATIONS_PARAM, 70);
>  params.put(TrainingParameters.CUTOFF_PARAM, 1);
> TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, 
> sampleStream,
>  params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), 
> new BioCodec()));
> TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
> // now test if it can detect the sample sentences
>  String[] sentence = {"Alisa",
>  "and",
>  "Mike",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> Span[] names = nameFinder.find(sentence);
>  if (names != null && names.length != 0) {
>  System.out.println(" > Found ["+names.length+"] results");
>  for(Span name : names){
>  String personName="";
>  for(int i=name.getStart(); i<name.getEnd(); i++){
>  personName+=sentence[i]+" ";
>  }
>  System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : 
> PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
>  }
>  } else {
>  System.out.println(" > No results found");
>  }
> {code}
>  
>  
> h3. +Result for (1):+
> Indexing events with TwoPass using cutoff of 1
>  Computing event counts... done. 1392 events
>  Indexing... done.
>  Collecting events... Done indexing in 0.22 s.
>  Incorporating indexed data for training... 
>  done.
>  Number of Event Tokens: 1392
>  Number of Outcomes: 3
>  Number of Predicates: 9164
>  Computing model parameters...
>  Performing 70 iterations.
>  1: . (1355/1392) 0.9734195402298851
>  2: . (1383/1392) 0.9935344827586207
>  3: . (1390/1392) 0.9985632183908046
>  4: . (1390/1392) 0.9985632183908046
>  5: . (1391/1392) 0.9992816091954023
>  6: . (1392/1392) 1.0
>  7: . (1392/1392) 1.0
>  8: . (1392/1392) 1.0
>  9: . (1392/1392) 1.0
>  Stopping: change in training set accuracy less than 1.0E-5
>  Stats: (1392/1392) 1.0
>  ...done.
> *Found [1] results*
>  *Result 1: Type: [default] : PersonName: [Alisa ] 
> [probability=0.5483001511243855]*
> h3.  
> +Result for (2):+
> Indexing events with TwoPass using cutoff of 1
>  Computing event counts... done. 1392 events
>  Indexing... done.
>  Collecting events... Done indexing in 0.22 s.
>  Incorporating indexed data for training... 
>  done.
>  Number of Event Tokens: 1392
>  Number of Outcomes: 3
>  Number of Predicates: 9164
>  Computing model parameters...
>  Performing 70 iterations.
>  1: . (1355/1392) 0.9734195402298851
>  2: . (1383/1392) 0.9935344827586207
>  3: . (1390/1392) 0.9985632183908046
>  4: . (1390/1392) 0.9985632183908046
>  5: . (1391/1392) 0.9992816091954023
>  6: . (1392/1392) 1.0
>  7: . (1392/1392) 1.0
>  8: . (1392/1392) 1.0
>  9: . (1392/1392) 1.0
>  Stopping: change in training set accuracy less than 1.0E-5
>  Stats: (1392/1392) 1.0
>  ...done.
> *Found [1] results*
>  *Result 1: Type: [default] : PersonName: [Mike ] 
> [probability=0.460685209028902]*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OPENNLP-1309) NameFinderME - Unexpected result using unchanged training data

Reply via email to