[
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728023#comment-14728023
]
Tim Allison edited comment on TIKA-1723 at 9/2/15 8:59 PM:
-----------------------------------------------------------
Ken,
This looks great. And, yes, I wouldn't want anyone to confuse your patch2
with my horrible mess. :)
To confirm, is this the overall goal:
# Make language detection configurable via TikaConfig
# Create a separate package tika-lang-detect (or similar) and put various
language detection implementations/dependencies there including Tika's legacy
detection code and Optimaize?
# Make Optimaize the default language detector in tika-app and tika-server
# Add other lang detectors as desired to the new package
# Deprecate and then eventually remove ProfilingHandler and ProfilingWriter
If everyone is ok with committing the patch as is and then doing some fairly
substantial moving next week (or so) into the new package, then, y, go for it.
I'm excited to try out Optimaize. Thank you for the integration!
was (Author: [email protected]):
Ken,
This looks great. And, yes, I wouldn't want anyone to confuse your patch2
with my horrible mess. :)
To confirm, is this the overall goal:
# Make language detection configurable via TikaConfig
# Create a separate package tika-lang-detect (or similar) and put various
language detection implementations/dependencies there?
# Move the implementation of Tika's legacy language detection to the new
package and wrap it as a Language Detector.
# Make Optimaize the default language detector in tika-app and tika-server
# Add other lang detectors as desired to the new package
# Deprecate and then eventually remove ProfilingHandler and ProfilingWriter
If everyone is ok with committing the patch as is and then doing some fairly
substantial moving next week (or so) into the new package, then, y, go for it.
I'm excited to try out Optimaize. Thank you for the integration!
> Integrate language-detector into Tika
> -------------------------------------
>
> Key: TIKA-1723
> URL: https://issues.apache.org/jira/browse/TIKA-1723
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 1.11
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Priority: Minor
> Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch,
> TIKA-1723v2.patch
>
>
> The language-detector project at
> https://github.com/optimaize/language-detector is faster, has more languages
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a
> number of issues this raises, especially if [~chrismattmann] moves forward
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)