[
https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295738#comment-13295738
]
Julien Nioche commented on NUTCH-1397:
--------------------------------------
Lewis, the language identification is a combination of parsing of the html
(done in Nutch) with statistical guessing (from Tika). The parser component
ignores compound values and returns only the main language code, as for the
statistical component is returns only the 2 letter code (and given how bad it
is at it, I don't think it would be wise to try and get it to be more
specific). In a nutshell these compound language codes are not supported in
Nutch. We could possible store a separate value with the secondary code when
available from the parsing but not the identifier.
Makes sense?
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
> Key: NUTCH-1397
> URL: https://issues.apache.org/jira/browse/NUTCH-1397
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: nutchgora, 1.5
> Reporter: Lewis John McGibbney
> Priority: Minor
> Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies
> langauge-type=en, however does not identify en-GB or en-US. This issues
> should correct that.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira