Encoding based language estimate wrong for UTF-8 plaintext
----------------------------------------------------------

                 Key: TIKA-501
                 URL: https://issues.apache.org/jira/browse/TIKA-501
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 0.7
            Reporter: Jan Høydahl


Using the CLI tool on plain-text file and outputting metadata.
The "Content-Language:" is output based on encoding based language estimate. 
But it is not reliable as it does not detect anything for UTF-8 and detects 
english for ISO-8859-1.

Jukka wrote:
{quote}
We already dropped encoding-based language estimates from the HTML
parser, and I think we should do the same also for plain text
documents.
{quote}

Chris, Paul and Ingo already +1'ed this on the mailing list.

PS: I think it is unclear that "Content-Language" is not based on the 
LanguageIdentifier feature. Would make sense to clarify this. However, there's 
another issue filed to enable true language identification from CLI as well, 
which would fill this gap.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to