Add ability to create language profiles to tika-app
---------------------------------------------------

                 Key: TIKA-546
                 URL: https://issues.apache.org/jira/browse/TIKA-546
             Project: Tika
          Issue Type: New Feature
          Components: cli, languageidentifier
    Affects Versions: 0.7
            Reporter: Jan Høydahl


Since TIKA-490 it is supposed to be easy adding new language profiles to TIKA. 
However, currently the process involves using Nutch's NGramProfile tool and 
editing the output.

We should port Nutch's profile builder to Tika and make it part of tika-app.jar:
# See http://wiki.apache.org/nutch/LanguageIdentifier
# java -jar tika-app.jar --create-profile [--gramsizes=<n>,<n>,...] 
[--maxlines=<max>] <profile-name> <filename> <encoding>

Using --gramsizes and --maxlines, we could support both Tika-style profiles and 
Nutch-style profiles and thus deprecate the Nutch tool. Defaults should be 
--gramsizes=3 --maxlines=1000


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to