[
https://issues.apache.org/jira/browse/NUTCH-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461637#comment-17461637
]
Lewis John McGibbney commented on NUTCH-2278:
---------------------------------------------
Out of curiosity [~Fengtan] are you still around? I never saw this JIRA ticket
until now :(
> Handle alpha-2 language codes consistently
> ------------------------------------------
>
> Key: NUTCH-2278
> URL: https://issues.apache.org/jira/browse/NUTCH-2278
> Project: Nutch
> Issue Type: Improvement
> Components: plugin
> Affects Versions: 1.12
> Reporter: Fengtan
> Priority: Minor
> Fix For: 1.19
>
> Attachments: NUTCH-2278.patch, NUTCH-2278.patch
>
>
> The language-identifier plugin provides two extraction policies: detect and
> identify.
> However the two policies handle
> [alpha-2|https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2] codes differently:
> * 'identify' strips out the alpha-2 code e.g. if the identified language is
> 'en-US' then it will inject 'en' in the meta tags
> * 'detect' does not strip out the alpha-2 code e.g. if the detected language
> is 'en-US' then it will inject 'en-US' in the meta tags
> Any chance we can make this consistent and always strip out the alpha-2 code ?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)