[jira] [Commented] (OPENNLP-1270) Add new languages to the language detector

Tim Allison (Jira) Mon, 29 Mar 2021 12:36:04 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310946#comment-17310946
 ]


Tim Allison commented on OPENNLP-1270:
--------------------------------------

We recently got a request on TIKA-3340 to add detection of Burmese.  Leipzig 
has a small amount of this: ~2k sentences, but the script is distinct, and it 
didn't look like there was a bunch of English or other languages in the Leipzig 
data.

In the link above which no longer works, I had unpacked the following leipzig 
langs not currently covered by OpenNLP: amh asm azj ban div hat mhr ori tuk uig 
xho yid.

For Tika, I rebuilt a lang detect model on the original opennlp leipzig data 
and the extra langs.

If I redo the work to unpack the above languages and Burmese, are there any 
objections if I commit them to: 
https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data ?

Thank you!

> Add new languages to the language detector
> ------------------------------------------
>
>                 Key: OPENNLP-1270
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1270
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.9.4
>
>         Attachments: report.txt, report.txt
>
>
> Leipzig has several other languages that might be useful to add to the 
> language detector.  I've selected some with > 10k sentences.  Once I build 
> the model and evaluate performance, I'll share the reports, the model and a 
> tgz of the *-sentences.txt files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OPENNLP-1270) Add new languages to the language detector

Reply via email to