[ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132613#comment-15132613
 ] 

Tim Allison commented on TIKA-1723:
-----------------------------------

Agreed on the ease of building the new ld framework in 2.0.  

Given Mike's comparison of Tika and langdetect 
[here|http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html],
 even though it is now dated, I'd be willing to put our language detector on 
mothballs in 2.x (i.e. leave it in 1.x, and if we need to resurrect it we can). 
 That said, I didn't write that code, and I know that [~toke] on TIKA-1549 has 
since dramatically improved our speed.

This is certainly a large enough issue to invite feedback from the entire 
community.  Do we want to drop our language detection code in 2.x?  Or is there 
a good reason to keep it?



> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, 
> TIKA-1723v2.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to