[ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720700#comment-14720700
 ] 

Ken Krugler commented on TIKA-1723:
-----------------------------------

Hi Tim - re putting language detection into the handler. I'd been thinking 
about how best to add language attributes to the XHTML being generated by the 
parsers, as I think that's the right way to handle multi-lingual documents (I 
assume that's what you mean by "dual language detection").

The problem is that you'd want the output to be hierarchical, in that <html 
lang=xx xml:lang=xx> is where you'd want to specify the "primary" language for 
the document, and then only add the lang attributes to elements where it's 
different.

But that would require deferring the output of all XHTML until after the 
document had been processed, or processing it twice, which seems ugly. So the 
other solution would be to add language tags at every opportunity (any element 
that supports the lang attribute). Though you'd only have to do this if the 
language was different from the enclosing element's language. But you'd want to 
process each chunk of text individually, e.g. you wouldn't know in advance if 
there's a list with a different language for each item.

Which means this is getting pretty complicated.


> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723.patch, TIKA-1723v2.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to