[
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720675#comment-14720675
]
Ken Krugler commented on TIKA-1723:
-----------------------------------
Hi Tim - thanks for the fast review.
1. Re confidence scores...yes they'll have different ranges & meanings for
their raw scores. So that's why I'd put the comment into LanguageResult about
these being normalized to conform to the range constants defined previously.
But I like your idea better - call this a "rawScore", and have a separate
enumerated confidence value (LOW, MED, HIGH). I'll go make that change.
2. Re setPriors - I haven't seen a case where it's necessary to dynamically
change the a priori probabilities when using language-detector, so I'd propose
having an alternative loadModels(Map<String, Float>)). This way the detector
could load a different model depending on the probability (as an example). But
having a separate call to set the probabilities is also possible. Though in
that case, what if the set of languages doesn't match what was previously
loaded? Throw an error?
3. Re OptimaizeLangDetector - sure, makes sense to rename it.
> Integrate language-detector into Tika
> -------------------------------------
>
> Key: TIKA-1723
> URL: https://issues.apache.org/jira/browse/TIKA-1723
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 1.11
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Priority: Minor
> Attachments: TIKA-1723.patch, TIKA-1723v2.patch
>
>
> The language-detector project at
> https://github.com/optimaize/language-detector is faster, has more languages
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a
> number of issues this raises, especially if [~chrismattmann] moves forward
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)