[ http://issues.apache.org/jira/browse/NUTCH-162?page=comments#action_12362363 ]
KuroSaka TeruHiko commented on NUTCH-162: ----------------------------------------- I agree with Paul in principle. With the current way of designating language by the lang code alone, there is no way to distinguish Simplified Chinese and Traditional Chinese, the written variants of Chinese language. These have been traditionally distinguished by zh_cc where cc is a county code such as tw (Taiwan which uses Traditional form) or cn (People's Repblic of China which uses Simplified form). Nutch currently displays Simplified Chinese for "zh", which would disappoint Traditional Chinese readers. I am not too sure about *always* using the llcc naming convention. (1) To be compatible with the web standard & practice, and not inventing another naming convention, I would prefer using minus as delimiter, e.g. en-au. (2) Not all languages need country modifier. Japanese, for example, is spoken (by a large enough community that devlopes its own dialect) only in Japan. Major browsers send out "ja", not "ja-jp". (3) We would still need the generic "en" (or "fr") because there is a generic English (French) setting in many browsers with which "en" is sent without country code, and because we would need a fall back locale when unsupported country variants of English is specified by the browser. (By the way, jwJA is an interesting example, but jw is not a registered ISO language code (it's jv), or Java is not a country.) Another things we need to be concerned is the implication of the new locale naming scheme. The language identifier and analyzer plugin (in Trunk) are consumers of the locale too. The current code assumes the two letter language names. This needs to be extended to accept both types of names, ll and ll-cc. > country code "jp" is used instead of language code "ja" for Japanese > -------------------------------------------------------------------- > > Key: NUTCH-162 > URL: http://issues.apache.org/jira/browse/NUTCH-162 > Project: Nutch > Type: Bug > Components: web gui > Versions: 0.7.1 > Environment: n/a > Reporter: KuroSaka TeruHiko > Priority: Trivial > > In locale switching link for Japanese, "jp" is used as language code but it > is an ISO country code. The language code "ja" should be used. > By the way, I don't think many users are familiar with the ISO language > codes. A Canadian user may click on "ca" uknowoing that ca stands for > Catalan, not Canadian English or French. Rather than listing the language > code, listing the language names in the prospective languages may be better. > (I say "may be" because the browser could show some language names in > corrupted text if the current font does not support that language --- this is > a difficult problem.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
