[
https://issues.apache.org/jira/browse/OPENNLP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206195#comment-17206195
]
Rodrigo Agerri commented on OPENNLP-1309:
-----------------------------------------
Hello,
This is not a bug. Named Entity Recognition is learned contextually. Thus, if
the contexts given to train change then the predictions will most likely change
too. It does not predict named entities by simply memorizing the ones seen
during training. As commented above, a unit test is not the way to evaluate
performance but on an unseen test set using some objective metrics such as F1
score.
If you would like to train the NameFinder for English, you can check Ontonotes
5.0 corpus, it is free. There are a number of other corpora free also, you can
check several here:
[https://github.com/juand-r/entity-recognition-datasets]
Hope this helps,
R
> NameFinderME - Unexpected result using unchanged training data
> --------------------------------------------------------------
>
> Key: OPENNLP-1309
> URL: https://issues.apache.org/jira/browse/OPENNLP-1309
> Project: OpenNLP
> Issue Type: Bug
> Components: Name Finder
> Affects Versions: 1.9.2
> Reporter: Michael
> Priority: Major
>
>
> Hello,
> Based on
> [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java]
> / function _testNameFinder()_, I have written a simple test code and changed
> the [test
> sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
> from *(1)*:
> {code:java}
> String[] sentence = {"Alisa",
> "appreciated",
> "the",
> "hint",
> "and",
> "enjoyed",
> "a",
> "delicious",
> "traditional",
> "meal."};
> {code}
> to *(2)*:
> {code:java}
> String[] sentence = {"Alisa",
> "and",
> "Mike",
> "appreciated",
> "the",
> "hint",
> "and",
> "enjoyed",
> "a",
> "delicious",
> "traditional",
> "meal."};
> {code}
> (Just added "_and Mike_") and expected to get 2 results (two names _Alisa_
> and _Mike_) because both names are annotated in the training data. I just get
> 1 result (Mike) for *(2)*. I used the training data file
> [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt]
> (unchanged).
> Can anyone tell me what's wrong? Thanks.
> h3. +Test code:+
>
> {code:java}
> String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
> String encoding = "ISO-8859-1";
> ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new
> PlainTextByLineStream(new MarkableFileInputStreamFactory(new
> File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
>
> TrainingParameters params = new TrainingParameters();
> params.put(TrainingParameters.ITERATIONS_PARAM, 70);
> params.put(TrainingParameters.CUTOFF_PARAM, 1);
> TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null,
> sampleStream,
> params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(),
> new BioCodec()));
> TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
> // now test if it can detect the sample sentences
> String[] sentence = {"Alisa",
> "and",
> "Mike",
> "appreciated",
> "the",
> "hint",
> "and",
> "enjoyed",
> "a",
> "delicious",
> "traditional",
> "meal."};
> Span[] names = nameFinder.find(sentence);
> if (names != null && names.length != 0) {
> System.out.println(" > Found ["+names.length+"] results");
> for(Span name : names){
> String personName="";
> for(int i=name.getStart(); i<name.getEnd(); i++){
> personName+=sentence[i]+" ";
> }
> System.out.println(" > Result "+1+": Type: ["+name.getType()+"] :
> PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
> }
> } else {
> System.out.println(" > No results found");
> }
> {code}
>
>
> h3. +Result for (1):+
> Indexing events with TwoPass using cutoff of 1
> Computing event counts... done. 1392 events
> Indexing... done.
> Collecting events... Done indexing in 0.22 s.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 1392
> Number of Outcomes: 3
> Number of Predicates: 9164
> Computing model parameters...
> Performing 70 iterations.
> 1: . (1355/1392) 0.9734195402298851
> 2: . (1383/1392) 0.9935344827586207
> 3: . (1390/1392) 0.9985632183908046
> 4: . (1390/1392) 0.9985632183908046
> 5: . (1391/1392) 0.9992816091954023
> 6: . (1392/1392) 1.0
> 7: . (1392/1392) 1.0
> 8: . (1392/1392) 1.0
> 9: . (1392/1392) 1.0
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1392/1392) 1.0
> ...done.
> *Found [1] results*
> *Result 1: Type: [default] : PersonName: [Alisa ]
> [probability=0.5483001511243855]*
> h3.
> +Result for (2):+
> Indexing events with TwoPass using cutoff of 1
> Computing event counts... done. 1392 events
> Indexing... done.
> Collecting events... Done indexing in 0.22 s.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 1392
> Number of Outcomes: 3
> Number of Predicates: 9164
> Computing model parameters...
> Performing 70 iterations.
> 1: . (1355/1392) 0.9734195402298851
> 2: . (1383/1392) 0.9935344827586207
> 3: . (1390/1392) 0.9985632183908046
> 4: . (1390/1392) 0.9985632183908046
> 5: . (1391/1392) 0.9992816091954023
> 6: . (1392/1392) 1.0
> 7: . (1392/1392) 1.0
> 8: . (1392/1392) 1.0
> 9: . (1392/1392) 1.0
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1392/1392) 1.0
> ...done.
> *Found [1] results*
> *Result 1: Type: [default] : PersonName: [Mike ]
> [probability=0.460685209028902]*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)