[
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718494#comment-14718494
]
Tim Allison edited comment on TIKA-1723 at 8/28/15 12:56 PM:
-------------------------------------------------------------
I've only taken a brief look, but I think that moving to an abstract
LanguageDetector is great!
# On the confidence scores...I suspect that different detectors will have
different underlying distributions. Should we have the Tika wrapper for each
detector munge the confidence scores so that they are normally distributed
around .5 (or something), or should we let each detector determine "high" or
"medium"...still include the raw confidence scores, but add a variable to
LanguageResult for "high/medium/low".
# Should we add a {{setPriors(Map<String,Float> langPriors}} to
LanguageDetector? Some implementations may or may not use it (similar to what
you're doing with mixedLanguages and shortText).
# Could we rename {{LangDetector}} to something like {{OptimaizeLangDetector}}
so that in the future if we integrate cybozulab's langdetect, we won't have
confusion?
# Should we take this opportunity to create a new tika-langdetect module?
On a related note, as I was looking at ProfilingHandler and ProfilingWriter,
I'm wondering if it isn't possible to include the language detection in the
handler component instead of the writer component. I _think_ this would allow
easier dual language detection and content handling. The goal would be
something like what [~chrismattmann] added to tika-server: wrap ToXMLHandler
(or friend) in a LanguageDetectorHandler...the XMLHandler would write the chars
to the specified outputstream, and the LanguageDetectorHandler would compute
the language detection stats.
If there's an obvious way to do this now, please let me know. If not, I can
try to implement this with our current language detector and then your
Optimaize wrapper and all others that we choose could use that?
was (Author: [email protected]):
I've only taken a brief look, but I think that moving to an abstract
LanguageDetector is great!
# On the confidence scores...I suspect that different detectors will have
different underlying distributions. Should we have the Tika wrapper for each
detector munge the confidence scores so that they are normally distributed
around .5 (or something), or should we let each detector determine "high" or
"medium"...still include the raw confidence scores, but add a variable to
LanguageResult for "high/medium/low".
# Should we add a {{setPriors(Map<String,Float> langPriors}} to
LanguageDetector? Some implementations may or may not use it (similar to what
you're doing with mixedLanguages and shortText).
# Could we rename {{LangDetector}} to something like {{OptimaizeLangDetector}}
so that in the future if we integrate cybozulabs' langdetect, we won't have
confusion?
On a related note, as I was looking at ProfilingHandler and ProfilingWriter,
I'm wondering if it isn't possible to include the language detection in the
handler component instead of the writer component. I _think_ this would allow
easier dual language detection and content handling. The goal would be
something like what [~chrismattmann] added to tika-server: wrap ToXMLHandler
(or friend) in a LanguageDetectorHandler...the XMLHandler would write the chars
to the specified outputstream, and the LanguageDetectorHandler would compute
the language detection stats.
If there's an obvious way to do this now, please let me know. If not, I can
try to implement this with our current language detector and then your
Optimaize wrapper and all others that we choose could use that?
> Integrate language-detector into Tika
> -------------------------------------
>
> Key: TIKA-1723
> URL: https://issues.apache.org/jira/browse/TIKA-1723
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 1.11
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Priority: Minor
> Attachments: TIKA-1723.patch
>
>
> The language-detector project at
> https://github.com/optimaize/language-detector is faster, has more languages
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a
> number of issues this raises, especially if [~chrismattmann] moves forward
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)