[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

Ken Krugler (JIRA) Thu, 27 Aug 2015 16:47:29 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717768#comment-14717768
 ]


Ken Krugler commented on TIKA-1723:
-----------------------------------

Part of this work is looking to make the API for language detection more 
generic - currently it's tightly coupled to the existing internal 
implementation.

For example, a LanguageProfile is used for both the target language model and 
what's built from character statistics, but this isn't how it always works for 
other detectors.

And LanguageProfile exposes public details about the implementation, e.g. 
DEFAULT_NGRAM_LENGTH is a public constant.

I've created an abstract LanguageDetector class plus a few new concrete 
classes, and have integrated language-detector using these.

But in order to not break compatibility with existing users, I've left the 
current implementation in place. If the patch looks promising, I could turn 
those into facades for the new implementation, and mark them as deprecated.

> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

Reply via email to