[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Julien Nioche (JIRA) Fri, 15 Jun 2012 09:02:43 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295738#comment-13295738
 ]


Julien Nioche commented on NUTCH-1397:
--------------------------------------

Lewis, the language identification is a combination of parsing of the html 
(done in Nutch) with statistical guessing (from Tika). The parser component 
ignores compound values and returns only the main language code, as for the 
statistical component is returns only the 2 letter code (and given how bad it 
is at it, I don't think it would be wise to try and get it to be more 
specific). In a nutshell these compound language codes are not supported in 
Nutch. We could possible store a separate value with the secondary code when 
available from the parsing but not the identifier.
Makes sense?
                
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies 
> langauge-type=en, however does not identify en-GB or en-US. This issues 
> should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Reply via email to