Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 06/08/2013 23:42, Stroller wrote: On 6 August 2013, at 14:04, Kerin Millar wrote: ... If undefined, the value of LC_COLLATE is inherited from LANG. I'm not sure that overriding it is particularly useful nowadays but it doesn't hurt. It's been a couple of years since I looked into this, but I'm given to believe that LANG should set all LC_ variables correctly, and that overriding them is frowned upon. As has been mentioned, there are valid reasons to want to override the collation. Here is a concrete example: https://lists.gnu.org/archive/html/bug-gnu-utils/2003-08/msg00537.html Strictly speaking, grep is correct to behave that way but it can be confounding. In an ideal world, everyone would be using named classes instead of ranges in their regular expressions but it's not an ideal world. These days, grep no longer exhibits this characteristic in Gentoo. Nevertheless, it serves as a valid example of how collations for UTF-8 locales can be a liability. Of the other distros, Arch Linux also defined LC_COLLATE=C although I understand that they have just recently stopped doing that. On a production system, I would still be inclined to use it for reasons of safety. For that matter, some people refuse to use UTF-8 at all on the grounds of security; the handling of variable-width encodings continues to be an effective bug inducer. I had to do this myself because, due to a bug, the en_GB time formatting failed to display am or pm. I believe this should be fixed now. Presumably: a) LANG was defined inappropriately b) LANG was defined appropriately but LC_TIME was defined otherwise c) LC_ALL was defined, trumping all I would definitely not advise doing any of these things. --Kerin
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 7 August 2013, at 13:41, Kerin Millar wrote: On 06/08/2013 23:42, Stroller wrote: On 6 August 2013, at 14:04, Kerin Millar wrote: ... If undefined, the value of LC_COLLATE is inherited from LANG. I'm not sure that overriding it is particularly useful nowadays but it doesn't hurt. It's been a couple of years since I looked into this, but I'm given to believe that LANG should set all LC_ variables correctly, and that overriding them is frowned upon. As has been mentioned, there are valid reasons to want to override the collation. Here is a concrete example: https://lists.gnu.org/archive/html/bug-gnu-utils/2003-08/msg00537.html Strictly speaking, grep is correct to behave that way but it can be confounding. Linking also this answer, which you're aware of: https://lists.gnu.org/archive/html/bug-gnu-utils/2003-08/msg00600.html This only goes to illustrate that you shouldn't be going overriding these willy-nilly without full awareness of why you're doing so and what you're doing. I had to do this myself because, due to a bug, the en_GB time formatting failed to display am or pm. I believe this should be fixed now. Presumably: a) LANG was defined inappropriately b) LANG was defined appropriately but LC_TIME was defined otherwise c) LC_ALL was defined, trumping all I'm having trouble parsing this reply, but perhaps you might find the full bug description helpful. I wrote about 1000 words on the subject there last year. It is the top Google hit for en_gb am pm bug: http://sourceware.org/bugzilla/show_bug.cgi?id=3768 Stroller.
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 07/08/2013 17:40, Stroller wrote: On 7 August 2013, at 13:41, Kerin Millar wrote: On 06/08/2013 23:42, Stroller wrote: On 6 August 2013, at 14:04, Kerin Millar wrote: ... If undefined, the value of LC_COLLATE is inherited from LANG. I'm not sure that overriding it is particularly useful nowadays but it doesn't hurt. It's been a couple of years since I looked into this, but I'm given to believe that LANG should set all LC_ variables correctly, and that overriding them is frowned upon. As has been mentioned, there are valid reasons to want to override the collation. Here is a concrete example: https://lists.gnu.org/archive/html/bug-gnu-utils/2003-08/msg00537.html Strictly speaking, grep is correct to behave that way but it can be confounding. Linking also this answer, which you're aware of: https://lists.gnu.org/archive/html/bug-gnu-utils/2003-08/msg00600.html Best practice will never be universally observed. This only goes to illustrate that you shouldn't be going overriding these willy-nilly without full awareness of why you're doing so and what you're doing. It also served to illustrate the overall point I was making - that sticking to the C/POSIX collation is not without value as a safety measure. Naturally, I would expect anyone else to exercise their own judgement. I had to do this myself because, due to a bug, the en_GB time formatting failed to display am or pm. I believe this should be fixed now. Presumably: a) LANG was defined inappropriately b) LANG was defined appropriately but LC_TIME was defined otherwise c) LC_ALL was defined, trumping all I'm having trouble parsing this reply, but perhaps you might find the full bug description helpful. I wrote about 1000 words on the subject there last year. It is the top Google hit for en_gb am pm bug: http://sourceware.org/bugzilla/show_bug.cgi?id=3768 OK. --Kerin
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 05/08/2013 23:52, Chris Stankevitz wrote: On Mon, Aug 5, 2013 at 11:53 AM, Mike Gilbert flop...@gentoo.org wrote: The handbook documents setting a system-wide default locale. You generally do this by setting the LANG variable in /etc/conf.d/02locale. http://www.gentoo.org/doc/en/handbook/handbook-amd64.xml?part=1chap=8#doc_chap3_sect3 Mike, Thank you for your help. I attempted to follow these instructions and ran into three problems. Can you please confirm the fixes I employed to deal with each of these issues: 1. The handbook suggests I should modify the file /etc/env.d/02locale, but that file does not exist on my system. RESOLUTION: create the file Run eselect locale, first with the list parameter and then the set parameter as appropriate. It's easier. 2. The handbook suggests I should add this line to /etc/env.d/02locale: 'LANG=de_DE.UTF-8', but I do not speak the language DE. RESOLUTION: type instead 'LANG=en_US.UTF-8' to match /etc/locale.gen Legitimate locales are those installed with glibc. These can be shown with either eselect locale list or locale -a. 3. The handbook suggests that I should add this line to /etc/env.d/02locale: 'LC_COLLATE=C', but I do not know if they are again talking about the language DE. RESOLUTION: I assumed LC_COLLATE=C refers to english and added the line without modification. C refers to the POSIX locale [1]. Defining LC_COLLATE is a workaround for behaviour deeemed surprising to those otherwise unaware of the impact of collations. For example, files beginning with a dot might no longer appear at the top of a directory listing and ranges in regular expressions may be affected, depending on the extent to which a given program abides by the locale. Poorly written shell scripts that capture from ls (assuming a given order) might also be affected. If undefined, the value of LC_COLLATE is inherited from LANG. I'm not sure that overriding it is particularly useful nowadays but it doesn't hurt. --Kerin [1] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_02
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Tue, Aug 06, 2013 at 02:04:00PM +0100, Kerin Millar wrote: Legitimate locales are those installed with glibc. These can be shown with either eselect locale list or locale -a. Having never used eselect with locales (AFAIR) before today. Why does locale -a return utf8? I know UTF-8 is accepted as standard, utf8 is not but usually recognized, but want to understand why locale -a output omits the standard, which is set on my systems, and differs from the others: o@workstation ~ $ eselect locale list Available targets for the LANG variable: [1] C [2] POSIX [3] en_US.utf8 [4] en_US.UTF-8 * [ ] (free form) mingdao@workstation ~ $ locale -a C POSIX en_US.utf8 mingdao@workstation ~ $ locale LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 LC_ALL= Cheers, Bruce -- Happy Penguin Computers ') 126 Fenco Drive ( \ Tupelo, MS 38801 ^^ supp...@happypenguincomputers.com 662-269-2706 662-205-6424 http://happypenguincomputers.com/ A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? Don't top-post: http://en.wikipedia.org/wiki/Top_post#Top-posting
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 06/08/2013 14:24, Bruce Hill wrote: On Tue, Aug 06, 2013 at 02:04:00PM +0100, Kerin Millar wrote: Legitimate locales are those installed with glibc. These can be shown with either eselect locale list or locale -a. Having never used eselect with locales (AFAIR) before today. Why does locale -a return utf8? I know UTF-8 is accepted as standard, utf8 is not but usually recognized, but want to understand why locale -a output omits the standard, which is set on my systems, and differs from the others: o@workstation ~ $ eselect locale list Available targets for the LANG variable: [1] C [2] POSIX [3] en_US.utf8 [4] en_US.UTF-8 * [ ] (free form) mingdao@workstation ~ $ locale -a C POSIX en_US.utf8 mingdao@workstation ~ $ locale LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8 LC_TIME=en_US.UTF-8 LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 LC_ALL= Apparently, utf8 is the canonical representation in glibc (which provides the locale tool): http://lists.debian.org/debian-glibc/2004/12/msg00028.html That eselect enumerates the locale twice when the alternate form is specified in /etc/env.d/02locale could be considered as a minor bug. --Kerin
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Tue, Aug 06, 2013 at 02:40:04PM +0100, Kerin Millar wrote: Apparently, utf8 is the canonical representation in glibc (which provides the locale tool): http://lists.debian.org/debian-glibc/2004/12/msg00028.html That eselect enumerates the locale twice when the alternate form is specified in /etc/env.d/02locale could be considered as a minor bug. --Kerin RFC 3629 does not mention utf8, but I did see this notation in Wikipedia, and yes, I understand that's not official: Other descriptions that omit the hyphen or replace it with a space, such as utf8 or UTF 8, are not accepted as correct by the governing standards.[14] Despite this, most agents such as browsers can understand them, and so standards intended to describe existing practice (such as HTML5) may effectively require their recognition. [14] http://www.ietf.org/rfc/rfc3629.txt I was only mildly curious seeing utf8 show up, because on numberous occasions in #gentoo on FreeNode there have been different reports of incorrect characters displayed with utf8, then fixed with UTF-8. Having read RFC 3629, I just made it a habit to always use the standard (UTF-8). Having read the remainder of the Debian ML thread you referenced, I have a headache. Debian did that to me when I used it for ~3 months in 2003. :-) Cheers, Bruce -- Happy Penguin Computers ') 126 Fenco Drive ( \ Tupelo, MS 38801 ^^ supp...@happypenguincomputers.com 662-269-2706 662-205-6424 http://happypenguincomputers.com/ A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? Don't top-post: http://en.wikipedia.org/wiki/Top_post#Top-posting
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 06/08/2013 15:26, Bruce Hill wrote: On Tue, Aug 06, 2013 at 02:40:04PM +0100, Kerin Millar wrote: Apparently, utf8 is the canonical representation in glibc (which provides the locale tool): http://lists.debian.org/debian-glibc/2004/12/msg00028.html That eselect enumerates the locale twice when the alternate form is specified in /etc/env.d/02locale could be considered as a minor bug. --Kerin RFC 3629 does not mention utf8, but I did see this notation in Wikipedia, and yes, I understand that's not official: Other descriptions that omit the hyphen or replace it with a space, such as utf8 or UTF 8, are not accepted as correct by the governing standards.[14] Despite this, most agents such as browsers can understand them, and so standards intended to describe existing practice (such as HTML5) may effectively require their recognition. [14] http://www.ietf.org/rfc/rfc3629.txt Internally, glibc may use whatever representation it pleases. I was only mildly curious seeing utf8 show up, because on numberous occasions in #gentoo on FreeNode there have been different reports of incorrect characters displayed with utf8, then fixed with UTF-8. Having read RFC 3629, I just made it a habit to always use the standard (UTF-8). Probably due to buggy applications. According to a glibc maintainer, they should be using the nl_langinfo() function but some try to read the locale name itself. The response of both of these commands is the same: # LC_ALL=en_US.UTF-8 locale -k LC_CTYPE | grep charmap # LC_ALL=en_US.utf8 locale -k LC_CTYPE | grep charmap Ergo, applications that use the correct interface will be informed that the character encoding is UTF-8, irrespective of the format of the locale name. Given the above, sticking to the lang_territory.UTF-8 format seems wise. Having read the remainder of the Debian ML thread you referenced, I have a headache. Debian did that to me when I used it for ~3 months in 2003. :-) Cheers, Bruce
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Mon, Aug 5, 2013 at 6:52 PM, Chris Stankevitz chrisstankev...@gmail.com wrote: On Mon, Aug 5, 2013 at 11:53 AM, Mike Gilbert flop...@gentoo.org wrote: The handbook documents setting a system-wide default locale. You generally do this by setting the LANG variable in /etc/conf.d/02locale. http://www.gentoo.org/doc/en/handbook/handbook-amd64.xml?part=1chap=8#doc_chap3_sect3 Mike, Thank you for your help. I attempted to follow these instructions and ran into three problems. Can you please confirm the fixes I employed to deal with each of these issues: I think the other responses in the thread have this covered, but I will respond anyway. 1. The handbook suggests I should modify the file /etc/env.d/02locale, but that file does not exist on my system. RESOLUTION: create the file Correct. This file can also be created by using eselect locale. 2. The handbook suggests I should add this line to /etc/env.d/02locale: 'LANG=de_DE.UTF-8', but I do not speak the language DE. RESOLUTION: type instead 'LANG=en_US.UTF-8' to match /etc/locale.gen Right, the de_DE is just an example. You should select a language/country that matches your lingual ability. :-) 3. The handbook suggests that I should add this line to /etc/env.d/02locale: 'LC_COLLATE=C', but I do not know if they are again talking about the language DE. RESOLUTION: I assumed LC_COLLATE=C refers to english and added the line without modification. LC_COLLATE specifies how to sort text strings. Setting it to C indicates that you want to sort strings based on the binary (ASCII) value of their characters. Leaving LC_COLLATE unset will cause strings to be sorted according to the normal rules associated with your locale. For example, given the following strings: cat Dog With LC_COLLATE=C, they are sorted like this, since the binary value of D (66) is less than the value of c (99). Dog cat With LC_COLLATE=en_US.UTF-8, they are sorted like this, since c comes before D in the alphabet. cat Dog
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Tue, Aug 6, 2013 at 6:04 AM, Kerin Millar kerfra...@fastmail.co.uk wrote: Run eselect locale, first with the list parameter and then the set parameter as appropriate. It's easier. Kerin, all, Thank for your help. SVN (and I'm sure other apps) are happy now. Chris
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Tue, Aug 6, 2013 at 8:13 AM, Mike Gilbert flop...@gentoo.org wrote: Leaving LC_COLLATE unset will cause strings to be sorted according to the normal rules associated with your locale. Mike (or anyone else), For which applications does setting LC_COLLATE affect sorting: a) Any C++ application that uses bool std::string::operator(const std::string) b) Any C or C++ application that compares char values using the '' operator c) Any application that uses the system call CompareStrings(const char*, const char*) d) [your answer here] I'm sure the answer is not a or b. I'm sure it's not c either since I just made it up. Thank you, Chris
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On 6 August 2013, at 14:04, Kerin Millar wrote: ... If undefined, the value of LC_COLLATE is inherited from LANG. I'm not sure that overriding it is particularly useful nowadays but it doesn't hurt. It's been a couple of years since I looked into this, but I'm given to believe that LANG should set all LC_ variables correctly, and that overriding them is frowned upon. I had to do this myself because, due to a bug, the en_GB time formatting failed to display am or pm. I believe this should be fixed now. Stroller.
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Tue, Aug 6, 2013 at 2:23 PM, Chris Stankevitz chrisstankev...@gmail.com wrote: On Tue, Aug 6, 2013 at 8:13 AM, Mike Gilbert flop...@gentoo.org wrote: Leaving LC_COLLATE unset will cause strings to be sorted according to the normal rules associated with your locale. Mike (or anyone else), For which applications does setting LC_COLLATE affect sorting: a) Any C++ application that uses bool std::string::operator(const std::string) b) Any C or C++ application that compares char values using the '' operator c) Any application that uses the system call CompareStrings(const char*, const char*) d) [your answer here] I'm sure the answer is not a or b. I'm sure it's not c either since I just made it up. From locale(7): LC_COLLATE This is used to change the behavior of the functions strcoll(3) and strxfrm(3), which are used to compare strings in the local alphabet. For example, the German sharp s is sorted as ss.
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Mon, Aug 5, 2013 at 2:25 PM, Chris Stankevitz chrisstankev...@gmail.com wrote: Hello, I am using svn to update a repository. Somebody added files to the repository with weird characters in the filename. SVN refuses to update the respository unless I first: export LC_CTYPE=en_US.UTF-8 I don't know or really care what that mumbo jumbo means, but I would like an answer to this question: Is my gentoo system properly setup? If not, what step did I miss that is causing svn to want me to export LC_CTYPE? I suspect either my gentoo system is messed up or svn is messed up. Sparing you the details as requested: In general, you want to be using a locale that ends with .UTF-8 to avoid encoding issues with software like python and subversion. The handbook documents setting a system-wide default locale. You generally do this by setting the LANG variable in /etc/conf.d/02locale. http://www.gentoo.org/doc/en/handbook/handbook-amd64.xml?part=1chap=8#doc_chap3_sect3
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Mon, Aug 05, 2013 at 02:53:11PM -0400, Mike Gilbert wrote: On Mon, Aug 5, 2013 at 2:25 PM, Chris Stankevitz chrisstankev...@gmail.com wrote: Hello, I am using svn to update a repository. Somebody added files to the repository with weird characters in the filename. SVN refuses to update the respository unless I first: export LC_CTYPE=en_US.UTF-8 I don't know or really care what that mumbo jumbo means, but I would like an answer to this question: Is my gentoo system properly setup? If not, what step did I miss that is causing svn to want me to export LC_CTYPE? I suspect either my gentoo system is messed up or svn is messed up. Sparing you the details as requested: In general, you want to be using a locale that ends with .UTF-8 to avoid encoding issues with software like python and subversion. The handbook documents setting a system-wide default locale. You generally do this by setting the LANG variable in /etc/conf.d/02locale. http://www.gentoo.org/doc/en/handbook/handbook-amd64.xml?part=1chap=8#doc_chap3_sect3 Without looking, shouldn't that be /etc/env.d/02locale ? -- Happy Penguin Computers ') 126 Fenco Drive ( \ Tupelo, MS 38801 ^^ supp...@happypenguincomputers.com 662-269-2706 662-205-6424 http://happypenguincomputers.com/ A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing in e-mail? Don't top-post: http://en.wikipedia.org/wiki/Top_post#Top-posting
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Mon, Aug 5, 2013 at 2:57 PM, Bruce Hill da...@happypenguincomputers.com wrote: On Mon, Aug 05, 2013 at 02:53:11PM -0400, Mike Gilbert wrote: On Mon, Aug 5, 2013 at 2:25 PM, Chris Stankevitz chrisstankev...@gmail.com wrote: Hello, I am using svn to update a repository. Somebody added files to the repository with weird characters in the filename. SVN refuses to update the respository unless I first: export LC_CTYPE=en_US.UTF-8 I don't know or really care what that mumbo jumbo means, but I would like an answer to this question: Is my gentoo system properly setup? If not, what step did I miss that is causing svn to want me to export LC_CTYPE? I suspect either my gentoo system is messed up or svn is messed up. Sparing you the details as requested: In general, you want to be using a locale that ends with .UTF-8 to avoid encoding issues with software like python and subversion. The handbook documents setting a system-wide default locale. You generally do this by setting the LANG variable in /etc/conf.d/02locale. http://www.gentoo.org/doc/en/handbook/handbook-amd64.xml?part=1chap=8#doc_chap3_sect3 Without looking, shouldn't that be /etc/env.d/02locale ? Yes. Or /etc/locale.conf if you're on systemd.
Re: [gentoo-user] export LC_CTYPE=en_US.UTF-8
On Mon, Aug 5, 2013 at 11:53 AM, Mike Gilbert flop...@gentoo.org wrote: The handbook documents setting a system-wide default locale. You generally do this by setting the LANG variable in /etc/conf.d/02locale. http://www.gentoo.org/doc/en/handbook/handbook-amd64.xml?part=1chap=8#doc_chap3_sect3 Mike, Thank you for your help. I attempted to follow these instructions and ran into three problems. Can you please confirm the fixes I employed to deal with each of these issues: 1. The handbook suggests I should modify the file /etc/env.d/02locale, but that file does not exist on my system. RESOLUTION: create the file 2. The handbook suggests I should add this line to /etc/env.d/02locale: 'LANG=de_DE.UTF-8', but I do not speak the language DE. RESOLUTION: type instead 'LANG=en_US.UTF-8' to match /etc/locale.gen 3. The handbook suggests that I should add this line to /etc/env.d/02locale: 'LC_COLLATE=C', but I do not know if they are again talking about the language DE. RESOLUTION: I assumed LC_COLLATE=C refers to english and added the line without modification. Thank you again for your help, Chris