[ 
https://issues.apache.org/jira/browse/OPENNLP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201120#comment-17201120
 ] 

Jeffrey T. Zemerick commented on OPENNLP-1309:
----------------------------------------------

 Hi [~micha2017], the inconsistency is likely due to the little amount of 
training data in that example. That training data file suffices for OpenNLP's 
unit tests but unfortunately is too small to train a reliable name finder. Note 
in the training log how the model training ended early due to the little change 
in accuracy. Ending that early is common with small training samples.

> NameFinderME - Unexpected result using unchanged training data
> --------------------------------------------------------------
>
>                 Key: OPENNLP-1309
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1309
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: 1.9.2
>            Reporter: Michael
>            Priority: Major
>
>  
> Hello,
> Based on 
> [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java]
>  / function _testNameFinder()_, I have written a simple test code and changed 
> the [test 
> sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
>  from *(1)*:
> {code:java}
> String[] sentence = {"Alisa",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> {code}
> to *(2)*:
> {code:java}
> String[] sentence = {"Alisa",
>  "and",
>  "Mike",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> {code}
> (Just added "_and Mike_") and expected to get 2 results (two names _Alisa_ 
> and _Mike_) because both names are annotated in the training data. I just get 
> 1 result (Mike) for *(2)*. I used the training data file 
> [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt]
>   (unchanged).
> Can anyone tell me what's wrong? Thanks.
> h3. +Test code:+
>  
> {code:java}
> String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
> String encoding = "ISO-8859-1";
>  ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new 
> PlainTextByLineStream(new MarkableFileInputStreamFactory(new 
> File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
>  
>  TrainingParameters params = new TrainingParameters();
>  params.put(TrainingParameters.ITERATIONS_PARAM, 70);
>  params.put(TrainingParameters.CUTOFF_PARAM, 1);
> TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, 
> sampleStream,
>  params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), 
> new BioCodec()));
> TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
> // now test if it can detect the sample sentences
>  String[] sentence = {"Alisa",
>  "and",
>  "Mike",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> Span[] names = nameFinder.find(sentence);
>  if (names != null && names.length != 0) {
>  System.out.println(" > Found ["+names.length+"] results");
>  for(Span name : names){
>  String personName="";
>  for(int i=name.getStart(); i<name.getEnd(); i++){
>  personName+=sentence[i]+" ";
>  }
>  System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : 
> PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
>  }
>  } else {
>  System.out.println(" > No results found");
>  }
> {code}
>  
>  
> h3. +Result for (1):+
> Indexing events with TwoPass using cutoff of 1
>  Computing event counts... done. 1392 events
>  Indexing... done.
>  Collecting events... Done indexing in 0.22 s.
>  Incorporating indexed data for training... 
>  done.
>  Number of Event Tokens: 1392
>  Number of Outcomes: 3
>  Number of Predicates: 9164
>  Computing model parameters...
>  Performing 70 iterations.
>  1: . (1355/1392) 0.9734195402298851
>  2: . (1383/1392) 0.9935344827586207
>  3: . (1390/1392) 0.9985632183908046
>  4: . (1390/1392) 0.9985632183908046
>  5: . (1391/1392) 0.9992816091954023
>  6: . (1392/1392) 1.0
>  7: . (1392/1392) 1.0
>  8: . (1392/1392) 1.0
>  9: . (1392/1392) 1.0
>  Stopping: change in training set accuracy less than 1.0E-5
>  Stats: (1392/1392) 1.0
>  ...done.
> *Found [1] results*
>  *Result 1: Type: [default] : PersonName: [Alisa ] 
> [probability=0.5483001511243855]*
> h3.  
> +Result for (2):+
> Indexing events with TwoPass using cutoff of 1
>  Computing event counts... done. 1392 events
>  Indexing... done.
>  Collecting events... Done indexing in 0.22 s.
>  Incorporating indexed data for training... 
>  done.
>  Number of Event Tokens: 1392
>  Number of Outcomes: 3
>  Number of Predicates: 9164
>  Computing model parameters...
>  Performing 70 iterations.
>  1: . (1355/1392) 0.9734195402298851
>  2: . (1383/1392) 0.9935344827586207
>  3: . (1390/1392) 0.9985632183908046
>  4: . (1390/1392) 0.9985632183908046
>  5: . (1391/1392) 0.9992816091954023
>  6: . (1392/1392) 1.0
>  7: . (1392/1392) 1.0
>  8: . (1392/1392) 1.0
>  9: . (1392/1392) 1.0
>  Stopping: change in training set accuracy less than 1.0E-5
>  Stats: (1392/1392) 1.0
>  ...done.
> *Found [1] results*
>  *Result 1: Type: [default] : PersonName: [Mike ] 
> [probability=0.460685209028902]*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to