[
https://issues.apache.org/jira/browse/TIKA-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-501:
-----------------------------
Attachment: TIKA-501.patch
> Encoding based language estimate wrong for UTF-8 plaintext
> ----------------------------------------------------------
>
> Key: TIKA-501
> URL: https://issues.apache.org/jira/browse/TIKA-501
> Project: Tika
> Issue Type: Bug
> Components: cli
> Affects Versions: 0.7
> Reporter: Jan Høydahl
> Assignee: Ken Krugler
> Priority: Minor
> Fix For: 0.8
>
> Attachments: TIKA-501.patch
>
>
> Using the CLI tool on plain-text file and outputting metadata.
> The "Content-Language:" is output based on encoding based language estimate.
> But it is not reliable as it does not detect anything for UTF-8 and detects
> english for ISO-8859-1.
> Jukka wrote:
> {quote}
> We already dropped encoding-based language estimates from the HTML
> parser, and I think we should do the same also for plain text
> documents.
> {quote}
> Chris, Paul and Ingo already +1'ed this on the mailing list.
> PS: I think it is unclear that "Content-Language" is not based on the
> LanguageIdentifier feature. Would make sense to clarify this. However,
> there's another issue filed to enable true language identification from CLI
> as well, which would fill this gap.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.