[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791960#action_12791960
 ] 

Dennis Kubes commented on NUTCH-666:
------------------------------------

BTW, the reason we did this code, which we worked with an NLP firm to create, 
versus using the current Langauge identification tool in Nutch was accuracy.  
The current tool we were getting around 70% accuracy level while this new tool 
routinely came in above 99.5% accuracy.  We trained off of wikipedia and most 
of the errors we saw were english characters in other-language version of the 
training data.  

> Analysis plugins for multiple language and new Language Identifier Tool
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-666
>                 URL: https://issues.apache.org/jira/browse/NUTCH-666
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.1
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.1
>
>         Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to