[ 
https://issues.apache.org/jira/browse/TIKA-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Krugler updated TIKA-501:
-----------------------------

    Attachment: TIKA-501.patch

> Encoding based language estimate wrong for UTF-8 plaintext
> ----------------------------------------------------------
>
>                 Key: TIKA-501
>                 URL: https://issues.apache.org/jira/browse/TIKA-501
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: TIKA-501.patch
>
>
> Using the CLI tool on plain-text file and outputting metadata.
> The "Content-Language:" is output based on encoding based language estimate. 
> But it is not reliable as it does not detect anything for UTF-8 and detects 
> english for ISO-8859-1.
> Jukka wrote:
> {quote}
> We already dropped encoding-based language estimates from the HTML
> parser, and I think we should do the same also for plain text
> documents.
> {quote}
> Chris, Paul and Ingo already +1'ed this on the mailing list.
> PS: I think it is unclear that "Content-Language" is not based on the 
> LanguageIdentifier feature. Would make sense to clarify this. However, 
> there's another issue filed to enable true language identification from CLI 
> as well, which would fill this gap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to