[
https://issues.apache.org/jira/browse/TIKA-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl updated TIKA-490:
-----------------------------
Attachment: TIKA-490.janhoy.082310.patch
Been thinking a bit more, and attaching a (in my eyes) approved patch.
Those who override the config, they need better logging, else they're in the
dark as to what happened.
Since we cannot use a log framework, I've added a way to detect and fetch
errors through the API.
Also I've removed the throw of IOException in constructor. Since there is only
a risk of that particular exception being thrown if the property file does not
exist, in my opinion this will NEVER happen. So we should not force people to
catch it. Also, it is better to keep backward API compatibility for all the
existing applications out there. Users that override config will not either get
the IOException since - if their tika.language.override.properties does not
exist, tika will load the built-in one. And if the property file exists but is
faulty, you'll not get IOException either. If there are errors in the
properties file, you can test that through the new methods, and even ask for a
list of languages that are successfully initialized.
Changes in the attached patch:
* public static boolean hasErrors()
* public static String getErrors()
* public static Set<String> getSupportedLanguages()
* Constructor no longer throws IOException
* Added better comments to public methods, including a warning for
isReasonablyCertain() and short texts
Sample output from a test application:
Supported languages: [is, da, it, no, hu, th, de, el, fi, pt, pl, sv, fr, en,
ru, et, es, nl]
Number of languages supported: 18
Has errors? true
Language xx (Unknown) not initialized. Message: Failed trying to load
language profile for language "xx". Error: null
> Support for adding language profiles dynamically
> ------------------------------------------------
>
> Key: TIKA-490
> URL: https://issues.apache.org/jira/browse/TIKA-490
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 0.7
> Reporter: Jan Høydahl
> Assignee: Chris A. Mattmann
> Fix For: 0.8
>
> Attachments: TIKA-490.janhoy.082310.patch,
> TIKA-490.Mattmann.082210.2.patch.txt, TIKA-490.Mattmann.082210.patch.txt,
> TIKA-490.patch, TIKA-490.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Currently the Tika LanguageIdentifier loads language profiles thorugh a
> hardcoded static block in the java code.
> It would be better to make this configurable, so you could add your own
> languages without recompiling.
> Suggested approach:
> Remove the static code block loading all languages. Instead look for a
> tika.languageidentification.properties file on classpath.
> Now the user can simply make his/her own (additional) language profile files,
> put them on the classpath together with a properties file and off you go!
> Also, once you make it configurable, there might be an issue of having the
> profiles as static members, as you will force the same behaviour for the
> whole VM. A static Map of Maps could solve this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.