Fengtan created NUTCH-2278:
------------------------------
Summary: Handle alpha-2 language codes consistently
Key: NUTCH-2278
URL: https://issues.apache.org/jira/browse/NUTCH-2278
Project: Nutch
Issue Type: Improvement
Components: plugin
Affects Versions: 1.12
Reporter: Fengtan
Priority: Minor
The language-identifier plugin provides two extraction policies: detect and
identify.
However the two policies handle
[alpha-2|https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2] codes differently:
* 'identify' strips out the alpha-2 code e.g. if the identified language is
'en-US' then it will inject 'en' in the meta tags
* 'detect' does not strip out the alpha-2 code e.g. if the detected language is
'en-US' then it will inject 'en-US' in the meta tags
Any chance we can make this consistent and always strip out the alpha-2 code ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)