[
https://issues.apache.org/jira/browse/OPENNLP-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310946#comment-17310946
]
Tim Allison commented on OPENNLP-1270:
--------------------------------------
We recently got a request on TIKA-3340 to add detection of Burmese. Leipzig
has a small amount of this: ~2k sentences, but the script is distinct, and it
didn't look like there was a bunch of English or other languages in the Leipzig
data.
In the link above which no longer works, I had unpacked the following leipzig
langs not currently covered by OpenNLP: amh asm azj ban div hat mhr ori tuk uig
xho yid.
For Tika, I rebuilt a lang detect model on the original opennlp leipzig data
and the extra langs.
If I redo the work to unpack the above languages and Burmese, are there any
objections if I commit them to:
https://svn.apache.org/repos/bigdata/opennlp/trunk/leipzig/data ?
Thank you!
> Add new languages to the language detector
> ------------------------------------------
>
> Key: OPENNLP-1270
> URL: https://issues.apache.org/jira/browse/OPENNLP-1270
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.9.4
>
> Attachments: report.txt, report.txt
>
>
> Leipzig has several other languages that might be useful to add to the
> language detector. I've selected some with > 10k sentences. Once I build
> the model and evaluate performance, I'll share the reports, the model and a
> tgz of the *-sentences.txt files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)