[ 
https://issues.apache.org/jira/browse/OPENNLP-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Kottmann closed OPENNLP-172.
---------------------------------

    Resolution: Fixed

> Replace the regex token class feature generation with the Character 
> class/unicode based implementation
> ------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-172
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-172
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Name Finder
>    Affects Versions: tools-1.5.1-incubating
>            Reporter: Jörn Kottmann
>            Assignee: Jörn Kottmann
>            Priority: Minor
>             Fix For: tools-1.5.2-incubating
>
>
> The token class feature is computed with the help of regular expression, the 
> regular expressions do not detect all-letter sequences correctly when they 
> contain other letters than A to Z. The new token class feature method uses 
> unicode to detect letters and that works better and is faster.  
> The old regular expression based token class feature computation should be 
> replaced with the new fast token class method.
> An evaluation on our spanish data showed that his change will reduce the 
> recall of the spanish person model by 2% and precision is identical. But when 
> the model is retrained with this fix applied the recall increases by 6%, and 
> precision is still identical.
> Recall and Precision are identical on my test data for english, because it 
> usually do not contain "special" characters.
> The speed up of the name finder will be roughly 10%.
> A measurement on the Leipzig corpus with 300K sentences increased the 
> throughput from 556 sent/s to 618 sent/s.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to