[
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716850#comment-16716850
]
J. Fiala commented on OPENNLP-1223:
-----------------------------------
License information:
[http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/index.html]
License
1. Research and evaluation purposes
For research and evaluation purposes, the TIGERCorpus can be downloaded for
free. However, we ask you to acknowledge the TIGERCorpus license agreement for
non-commercial use. The "Accept license terms" button at the bottom of the
license will then take you to the download page.
2. Commercial purposes
If you are interested in a commercial license of the TIGERCorpus, please
contact the secretary of Prof. Hans Uszkoreit's chair at Saarland University at
sek-hu AT coli DOT uni-saarland DOT de.
Pls let me know if we should contact the license chair to ask if a commercial
license is also needed if we supply a derived model from the corpus (and not
the corpus itself).
> Add NameFinder model based on Tiger
> -----------------------------------
>
> Key: OPENNLP-1223
> URL: https://issues.apache.org/jira/browse/OPENNLP-1223
> Project: OpenNLP
> Issue Type: New Feature
> Components: language model
> Reporter: J. Fiala
> Priority: Major
> Attachments: tiger_2.2_namefinder.bin.7z,
> tiger_2.2_namefinder.testdata.txt,
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart -
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name +
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>
> h3. Input data
> * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
> * yagoLabels.tsv.7z (Max Planck Institute)
>
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that
> (filtered) data. Or is it better to use the complete training data (including
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct:
> 7644.
> TOTAL: precision: 99,77%; recall: 99,80%; F1: 99,78%.
> person: precision: 99,77%; recall: 99,80%; F1: 99,78%. [target:
> 7659; tp: 7644; fp: 18]
>
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint
> <START:person> Salvador Lopez <END>Gonzalez , das Oberhaupt von
> <START:person> San Juan <END> <START:person> Juan Chamula <END> , einem
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)