[jira] [Comment Edited] (TIKA-1723) Integrate language-detector into Tika

Tim Allison (JIRA) Fri, 28 Aug 2015 05:57:08 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718494#comment-14718494
 ]


Tim Allison edited comment on TIKA-1723 at 8/28/15 12:56 PM:
-------------------------------------------------------------

I've only taken a brief look, but I think that moving to an abstract 
LanguageDetector is great!

# On the confidence scores...I suspect that different detectors will have 
different underlying distributions. Should we have the Tika wrapper for each 
detector munge the confidence scores so that they are normally distributed 
around .5 (or something), or should we let each detector determine "high" or 
"medium"...still include the raw confidence scores, but add a variable to 
LanguageResult for "high/medium/low".
# Should we add a {{setPriors(Map<String,Float> langPriors}} to 
LanguageDetector?  Some implementations may or may not use it (similar to what 
you're doing with mixedLanguages and shortText).
# Could we rename {{LangDetector}} to something like {{OptimaizeLangDetector}} 
so that in the future if we integrate cybozulab's langdetect, we won't have 
confusion?
# Should we take this opportunity to create a new tika-langdetect module?

On a related note, as I was looking at ProfilingHandler and ProfilingWriter, 
I'm wondering if it isn't possible to include the language detection in the 
handler component instead of the writer component.  I _think_ this would allow 
easier dual language detection and content handling.  The goal would be 
something like what [~chrismattmann] added to tika-server: wrap ToXMLHandler 
(or friend) in a LanguageDetectorHandler...the XMLHandler would write the chars 
to the specified outputstream, and the LanguageDetectorHandler would compute 
the language detection stats. 

If there's an obvious way to do this now, please let me know.  If not, I can 
try to implement this with our current language detector and then your 
Optimaize wrapper and all others that we choose could use that?

 


was (Author: [email protected]):
I've only taken a brief look, but I think that moving to an abstract 
LanguageDetector is great!

# On the confidence scores...I suspect that different detectors will have 
different underlying distributions. Should we have the Tika wrapper for each 
detector munge the confidence scores so that they are normally distributed 
around .5 (or something), or should we let each detector determine "high" or 
"medium"...still include the raw confidence scores, but add a variable to 
LanguageResult for "high/medium/low".
# Should we add a {{setPriors(Map<String,Float> langPriors}} to 
LanguageDetector?  Some implementations may or may not use it (similar to what 
you're doing with mixedLanguages and shortText).
# Could we rename {{LangDetector}} to something like {{OptimaizeLangDetector}} 
so that in the future if we integrate cybozulabs' langdetect, we won't have 
confusion?

On a related note, as I was looking at ProfilingHandler and ProfilingWriter, 
I'm wondering if it isn't possible to include the language detection in the 
handler component instead of the writer component.  I _think_ this would allow 
easier dual language detection and content handling.  The goal would be 
something like what [~chrismattmann] added to tika-server: wrap ToXMLHandler 
(or friend) in a LanguageDetectorHandler...the XMLHandler would write the chars 
to the specified outputstream, and the LanguageDetectorHandler would compute 
the language detection stats. 

If there's an obvious way to do this now, please let me know.  If not, I can 
try to implement this with our current language detector and then your 
Optimaize wrapper and all others that we choose could use that?

 

> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1723) Integrate language-detector into Tika

Reply via email to