[ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729638#comment-14729638 ]
Tim Allison commented on TIKA-1723: ----------------------------------- Y, I agree...that's a potential mess/challenge/opportunity. We might want to see how Solr's handling that now. This is overkill, but do we need a separate object for this: LanguageSpec that includes language, extlang, script, region, variant, extension, and private-use? all nullable except {{language}} For {{loadModels}} and {{hasModels}}, we could require an exact match. We could add a {{getMatchingModels}} that would return a set of models that match the non-null items? A LanguageResult would have a LanguageSpec object instead of a String and be the best effort parse of whatever the underlying lang id'er said. Again, this might just be too much... > Integrate language-detector into Tika > ------------------------------------- > > Key: TIKA-1723 > URL: https://issues.apache.org/jira/browse/TIKA-1723 > Project: Tika > Issue Type: Improvement > Components: languageidentifier > Affects Versions: 1.11 > Reporter: Ken Krugler > Assignee: Ken Krugler > Priority: Minor > Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, > TIKA-1723v2.patch > > > The language-detector project at > https://github.com/optimaize/language-detector is faster, has more languages > (70 vs 13) and better accuracy than the built-in language detector. > This is a stab at integrating it, with some initial findings. There are a > number of issues this raises, especially if [~chrismattmann] moves forward > with turning language detection into a pluggable extension point. > I'll add comments with results below. -- This message was sent by Atlassian JIRA (v6.3.4#6332)