On Sun, 30 Jan 2005, Roger Leigh wrote: > Marco d'Itri <[EMAIL PROTECTED]> writes: > > [EMAIL PROTECTED] wrote: > >>I think the locales package is the place to start this. For etch, I > >>would like the UTF-8 locales to be the default for all languages (with > > This would be stupid, pointless and would piss off a lot of people. > > Please could you explain why?
Do your homework about Unicode and locales. Hints for the googling: Unicode CJK unification problems. Also I can assure you 80% of the mail I see getting through the mail servers I admin is either latin-1 encoded, or that Windows CP1252 monstruosity (often mistagged as latin-1). Too much of it without any sort of charset declarations at all, since too many people use extremely crappy software. It is even worse for web pages. > > But since your native language is english I suppose that it may be > > hard to you to understand the reason for this. > > Please could you explain why English is different? ASCII, and the fact that most other charsets are backwards-compatible to ASCII for the first 128 codepoints. Try living in an EBCDIC world for a small while, even if you only use english. You will understand quite fast. > When I made the transition myself, I had to recode a number of files > to UTF-8 from the local encoding I was using previously (ISO-8859-1). > How does this differ for other languages and encodings? It doesn't, really. Not in that way. The problem is usually data exchange, and for CJK countries, that they often need extra language tagging which is not available on Unicode, but which IS implied by the other charsets. For XML documents, this is easy (if a lot troublesome) to fix. For regular text files, well... > Why? It's an undeniable fact that there is a cost associated with the > migration, but to avoid the migration will not be of long term benefit > to users of those locales. You are not in a position to know that yet, IMHO. Do some research, and then we can continue arguing if you still believe an UTF-8 default locale for all countries is a good idea. > emails without a specific charset which are not plain ASCII are most > likely broken in the first place. It's not our place to work around They ARE broken, according to the MIME standard. But they are too many. If I start killing anything non-ASCII from the headers, which is also illegal since rfc-822, I stop about 30% of the email flux (and no, most of it is NOT spam). That should give you an idea of the state of things in Brazil. I imagine in many other non-ASCII countries, things are just as bad if not worse. Heck, I keep rejecting emails with embedded NULLs and more than 8192 characters per line, which is unacceptable since RFC822, when the email world was young and there was no SPAM. > This goes against the general long-term plans for GNU/Linux i18n/l10n, > since UTF-8 is intended to unify the locale encodings, not to > perpetuate their mutual incompatibilities. That does not fit the current reality for the system locale, to many of us. Maybe in a few years. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

