[
https://issues.apache.org/jira/browse/OPENNLP-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Zemerick updated OPENNLP-1182:
-----------------------------------
Fix Version/s: (was: 1.8.5)
> LanguageDetectorConverterTool is a no-op, despite the docs saying otherwise
> ---------------------------------------------------------------------------
>
> Key: OPENNLP-1182
> URL: https://issues.apache.org/jira/browse/OPENNLP-1182
> Project: OpenNLP
> Issue Type: Bug
> Affects Versions: 1.8.4
> Reporter: Steve Rowe
> Priority: Major
>
> Contrary to the docs (see below), LanguageDetectorConverterTool doesn't
> actually do anything at all; the class is empty.
> {quote}
> The following sequence of commands shows how to convert the Leipzig Corpora
> collection at folder leipzig-train/ to the default Language Detector format,
> by creating groups of 5 sentences as documents and limiting to 10000
> documents per language. Them, it shuffles the result and select the first
> 100000 lines as train corpus and the last 20000 as evaluation corpus:
> {noformat}
> $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/
> -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt
> $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt >
> leipzig_shuf.txt
> $ head -100000 < leipzig_shuf.txt > leipzig.train
> $ tail -20000 < leipzig_shuf.txt > leipzig.eval
> {noformat}
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)