[ https://issues.apache.org/jira/browse/OPENNLP-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner closed OPENNLP-1615. ----------------------------------- > Provide more languages for pre-trained UD-based OpenNLP models > --------------------------------------------------------------- > > Key: OPENNLP-1615 > URL: https://issues.apache.org/jira/browse/OPENNLP-1615 > Project: OpenNLP > Issue Type: Improvement > Components: Models > Reporter: Martin Wiesner > Assignee: Martin Wiesner > Priority: Major > Fix For: 2.4.1 > > > As [https://universaldependencies.org|https://universaldependencies.org/] > offers treebanks for many languages, we should add further basic, pre-trained > models (Sentence detection, Tokenizer, POS tagging). > A first investigation has shown promising results for the following languages: > * “Bulgarian|bg|BTB” > * “Czech|cs|PDT” > * “Croatian|hr|SET” > * “Danish|da|DDT” > * “Estonian|et|EDT” > * “Finnish|fi|TDT” > * “Latvian|lv|LVTB” > * “Norwegian|no|Bokmaal” > * “Polish|pl|PDB” > * “Portuguese|pt|GSD” > * “Romanian|ro|RRT” > * “Russian|ru|GSD” > * “Serbian|sr|SET” > * “Slovak|sk|SNK” > * “Slovenian|sl|SSJ” > * “Spanish|es|GSD” > * “Swedish|sv|Talbanken” > * “Ukrainian|uk|IU” > The training succeeded and the eval results revealed a solid to excellent > performance. > Previously available languages, that is EN, FR, DE, NL, IT, should also be > retrained. > Aims: > * (Re-)Train the three models per language listed above with UD release 2.14 > * Package and release as JAR files via Maven Central > * Optional (?): Release the model files via the classic channel (website) -- This message was sent by Atlassian Jira (v8.20.10#820010)