This is an automated email from the ASF dual-hosted git repository.
mawiesne pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/opennlp.git
The following commit(s) were added to refs/heads/main by this push:
new bc72f12c OPENNLP-1373 : Fix CLI examples for CoNLL-2003 on
documentation (#497)
bc72f12c is described below
commit bc72f12c68088f30dfb1f4b872127eac61566ce4
Author: Atita Arora <[email protected]>
AuthorDate: Fri Feb 17 22:52:33 2023 +0100
OPENNLP-1373 : Fix CLI examples for CoNLL-2003 on documentation (#497)
---
opennlp-docs/src/docbkx/corpora.xml | 89 +++++++++++++++++++++----------------
1 file changed, 51 insertions(+), 38 deletions(-)
diff --git a/opennlp-docs/src/docbkx/corpora.xml
b/opennlp-docs/src/docbkx/corpora.xml
index b21f61a6..39977146 100644
--- a/opennlp-docs/src/docbkx/corpora.xml
+++ b/opennlp-docs/src/docbkx/corpora.xml
@@ -280,13 +280,13 @@ path: .\es_ner_person.bin]]>
To convert the information to the OpenNLP format:
<screen>
<![CDATA[
-$ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data
eng.train > corpus_train.txt]]>
+$ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data
corpus_train.txt > eng.train]]>
</screen>
Optionally, you can convert the training test samples as well.
<screen>
<![CDATA[
-$ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data
eng.testa > corpus_testa.txt
-$ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data
eng.testb > corpus_testb.txt]]>
+$ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data
corpus_testa.txt > eng.testa
+$ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data
corpus_testb.txt > eng.testb]]>
</screen>
</para>
</section>
@@ -296,7 +296,7 @@ $ opennlp TokenNameFinderConverter conll03 -lang eng -types
per -data eng.testb
You can train the model for the name finder this way:
<screen>
<![CDATA[
-$ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin -iterations
500 \
+$ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin \
-lang eng -types per -data eng.train
-encoding utf8]]>
</screen>
</para>
@@ -304,40 +304,55 @@ $ opennlp TokenNameFinderTrainer.conll03 -model
en_ner_person.bin -iterations 50
If you have converted the data, then you can train the model
for the name finder this way:
<screen>
<![CDATA[
-$ opennlp TokenNameFinderTrainer -model en_ner_person.bin -iterations 500 \
- -lang en -data corpus_train.txt -encoding
utf8]]>
+$ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin \
+ -lang eng -types per -data corpus_train.txt
-encoding utf8]]>
</screen>
</para>
<para>
Either way you should see the following output during the
training process:
<screen>
<![CDATA[
-Indexing events using cutoff of 5
+Indexing events with TwoPass using cutoff of 0
Computing event counts... done. 203621 events
Indexing... done.
-Sorting and merging events... done. Reduced 203621 events to 179409.
-Done indexing.
-Incorporating indexed data for training...
+Collecting events... Done indexing in 6,01 s.
+Incorporating indexed data for training...
done.
- Number of Event Tokens: 179409
+ Number of Event Tokens: 203621
Number of Outcomes: 3
- Number of Predicates: 58814
-...done.
+ Number of Predicates: 442041
Computing model parameters...
-Performing 500 iterations.
- 1: .. loglikelihood=-223700.5328318588 0.9453494482396216
- 2: .. loglikelihood=-40525.939777363084 0.9467933071736215
- 3: .. loglikelihood=-24893.98837874921 0.9598518816821447
- 4: .. loglikelihood=-18420.3379471033 0.9712996203731442
-... cut lots of iterations ...
-498: .. loglikelihood=-952.8501399442295 0.9988950059178572
-499: .. loglikelihood=-952.0600155746948 0.9988950059178572
-500: .. loglikelihood=-951.2722802086295 0.9988950059178572
-Writing name finder model ... done (1.638s)
+Performing 300 iterations.
+ 1: . (201717/203621) 0.9906492945226671
+ 2: . (202770/203621) 0.9958206668270955
+ 3: . (203129/203621) 0.9975837462737144
+ 4: . (203261/203621) 0.9982320094685715
+ 5: . (203381/203621) 0.9988213396457143
+ 6: . (203429/203621) 0.9990570717165714
+ 7: . (203454/203621) 0.9991798488368095
+ 8: . (203494/203621) 0.9993762922291906
+ 9: . (203509/203621) 0.9994499585013333
+ 10: . (203533/203621) 0.999567824536762
+ 20: . (203592/203621) 0.9998575785405238
+ 30: . (203613/203621) 0.9999607113215239
+Stopping: change in training set accuracy less than 1.0E-5
+Stats: (203621/203621) 1.0
+...done.
+
+Training data summary:
+#Sentences: 14041
+#Tokens: 203621
+#person entities: 6600
+
+Writing name finder model ... Compressed 442041 parameters to 29538
+4 outcome patterns
+done (0,395s)
Wrote name finder model to
-path: .\en_ner_person.bin]]>
+path: ./en_ner_person.bin
+
+Execution time: 11,498 seconds]]>
</screen>
</para>
</section>
@@ -356,29 +371,27 @@ $ opennlp TokenNameFinderEvaluator.conll03 -model
en_ner_person.bin \
model.
<screen>
<![CDATA[
-$ opennlp TokenNameFinderEvaluator -model en_ner_person.bin -lang en -data
corpus_testa.txt \
- -encoding utf8]]>
+$ opennlp TokenNameFinderEvaluator.conll03 -model en_ner_person.bin \
+ -lang eng -types per -data corpus_testa.txt
-encoding utf8]]>
</screen>
</para>
<para>
Either way you should see the following output:
<screen>
<![CDATA[
-Loading Token Name Finder model ... done (0.359s)
-current: 190.2 sent/s avg: 190.2 sent/s total: 199 sent
-current: 648.3 sent/s avg: 415.9 sent/s total: 850 sent
-current: 530.1 sent/s avg: 453.6 sent/s total: 1380 sent
-current: 793.8 sent/s avg: 539.0 sent/s total: 2178 sent
-current: 705.4 sent/s avg: 571.9 sent/s total: 2882 sent
+Loading Token Name Finder model ... done (0,176s)
+current: 1805,4 sent/s avg: 1805,4 sent/s total: 1961 sent
+
+Average: 2298,1 sent/s
+Total: 3454 sent
+Runtime: 1.503s
-Average: 569.4 sent/s
-Total: 3251 sent
-Runtime: 5.71s
+Evaluated 3453 samples with 1617 entities; found: 1472 entities; correct: 1370.
+ TOTAL: precision: 93,07%; recall: 84,72%; F1: 88,70%.
+ person: precision: 93,07%; recall: 84,72%; F1: 88,70%. [target:
1617; tp: 1370; fp: 102]
-Precision: 0.9366247297154147
-Recall: 0.739956568946797
-F-Measure: 0.8267557582133971]]>
+Execution time: 1,955 seconds]]>
</screen>
</para>
</section>