Add ability to create language profiles to tika-app
---------------------------------------------------
Key: TIKA-546
URL: https://issues.apache.org/jira/browse/TIKA-546
Project: Tika
Issue Type: New Feature
Components: cli, languageidentifier
Affects Versions: 0.7
Reporter: Jan Høydahl
Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA.
However, currently the process involves using Nutch's NGramProfile tool and
editing the output.
We should port Nutch's profile builder to Tika and make it part of tika-app.jar:
# See http://wiki.apache.org/nutch/LanguageIdentifier
# java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...]
[--maxlines=<max>] <profile-name> <filename> <encoding>
Using --gramsizes and --maxlines, we could support both Tika-style profiles and
Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be
--gramsizes=3 --maxlines=1000
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.