[ 
https://issues.apache.org/jira/browse/OPENNLP-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J. Fiala updated OPENNLP-1223:
------------------------------
    Description: 
Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - 
www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)

 

1.) add model based on tiger (/)

>>> generated based on 6.271 sentences with tagged names (always given name + 
>>> surname).

2.) add a few test sentences (/)

3.) add small evaluation file (/)

 
h3. Input data
 * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
 www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
 * yagoLabels.tsv.7z (Max Planck Institute)
 
[https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]

h3. Basic workflow

1.) Extract sentences in the tiger database with possible names (two words in 
sequence tagged as NE)

2.) Check if possible names include a given name based on the YAGO labels 
database (given name is assumed as first name)

3.) If given name is included in YAGO labels as givenName, then tag the person 
name

4.) Train with full data set (50.472 sentences - including non-names)

5.) Evaluate with person data set (6.271 sentences)
>>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
h3. Open questions

I first extracted 6.271 sentences mentioning names and trained based on that 
(filtered) data. Or is it better to use the complete training data (including 
the sentences without names)? (/)

>>> JF 14.10.: added steps 4 + 5
h3. Results

Results from step 5 above:

Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 7644.
        TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
       person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 
7659; tp: 7644; fp:  18]

 
h3. Further Improvements:

1.) There may be some names which are referring to locations which have to be 
refined (e.g. San Juan):

Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint <START:person> 
Salvador Lopez <END>Gonzalez , das Oberhaupt von <START:person> San Juan <END> 
<START:person> Juan Chamula <END> , einem pittoresken Ort hoch in den Bergen 
von .).

2.) Add support for names with more than two words (e.g. Salvador Lopez 
Gonzalez above).

3.) Check for context-sensitive non-name matches (e.g. "General")

  was:
Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - 
www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)

 

1.) add model based on tiger (/)

>>> generated based on 6.271 sentences with tagged names (always given name + 
>>> surname).

2.) add a few test sentences (/)

3.) add small evaluation file (/)

 
h3. Input data
 * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
 www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
 * yagoLabels.tsv.7z (Max Planck Institute)
 
[https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]

h3. Basic workflow

1.) Extract sentences in the tiger database with possible names (two words in 
sequence tagged as NE)

2.) Check if possible names include a given name based on the YAGO labels 
database (given name is assumed as first name)

3.) If given name is included in YAGO labels as givenName, then tag the person 
name

4.) Train with full data set (50.472 sentences - including non-names)

5.) Evaluate with person data set (6.271 sentences)
h3. Open questions

I first extracted 6.271 sentences mentioning names and trained based on that 
(filtered) data. Or is it better to use the complete training data (including 
the sentences without names)? (/)

>>> JF 14.10.: added steps 4 + 5
h3. Results

Results from step 5 above:

Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 7644.
       TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
      person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 
7659; tp: 7644; fp:  18]

 
h3. Further Improvements:

1.) There may be some names which are referring to locations which have to be 
refined (e.g. San Juan):

Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint <START:person> 
Salvador Lopez <END>Gonzalez , das Oberhaupt von <START:person> San Juan <END> 
<START:person> Juan Chamula <END> , einem pittoresken Ort hoch in den Bergen 
von .).

2.) Add support for names with more than two words (e.g. Salvador Lopez 
Gonzalez above).

3.) Check for context-sensitive non-name matches (e.g. "General")


> Add NameFinder model based on Tiger
> -----------------------------------
>
>                 Key: OPENNLP-1223
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1223
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: language model
>            Reporter: J. Fiala
>            Priority: Major
>         Attachments: tiger_2.2_namefinder.bin.7z, 
> tiger_2.2_namefinder.testdata.txt, 
> tiger_2.2_namefinder_all.bin_20181014.bin.7z, tiger_2.2_namefinder_eval.txt
>
>
> Add NameFinder model based on the Tiger treebank 2.2 (Universität Stuttgart - 
> www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>  
> 1.) add model based on tiger (/)
> >>> generated based on 6.271 sentences with tagged names (always given name + 
> >>> surname).
> 2.) add a few test sentences (/)
> 3.) add small evaluation file (/)
>  
> h3. Input data
>  * tigercorpus-2.2.conll09.tar.gz (Uni Stuttgart)
>  www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html
>  * yagoLabels.tsv.7z (Max Planck Institute)
>  
> [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/]
> h3. Basic workflow
> 1.) Extract sentences in the tiger database with possible names (two words in 
> sequence tagged as NE)
> 2.) Check if possible names include a given name based on the YAGO labels 
> database (given name is assumed as first name)
> 3.) If given name is included in YAGO labels as givenName, then tag the 
> person name
> 4.) Train with full data set (50.472 sentences - including non-names)
> 5.) Evaluate with person data set (6.271 sentences)
> >>> JF 14.10.: see updated model: tiger_2.2_namefinder_all.bin_20181014.bin.7z
> h3. Open questions
> I first extracted 6.271 sentences mentioning names and trained based on that 
> (filtered) data. Or is it better to use the complete training data (including 
> the sentences without names)? (/)
> >>> JF 14.10.: added steps 4 + 5
> h3. Results
> Results from step 5 above:
> Evaluated 6271 samples with 7659 entities; found: 7662 entities; correct: 
> 7644.
>         TOTAL: precision:   99,77%;  recall:   99,80%; F1:   99,78%.
>        person: precision:   99,77%;  recall:   99,80%; F1:   99,78%. [target: 
> 7659; tp: 7644; fp:  18]
>  
> h3. Further Improvements:
> 1.) There may be some names which are referring to locations which have to be 
> refined (e.g. San Juan):
> Fünf bis sechs Stunden , damit sie zur Besinnung kommen , meint 
> <START:person> Salvador Lopez <END>Gonzalez , das Oberhaupt von 
> <START:person> San Juan <END> <START:person> Juan Chamula <END> , einem 
> pittoresken Ort hoch in den Bergen von .).
> 2.) Add support for names with more than two words (e.g. Salvador Lopez 
> Gonzalez above).
> 3.) Check for context-sensitive non-name matches (e.g. "General")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to