On Thu, 2002-05-02 at 10:31, Dan Winship wrote: > [moving from evolution-patches to evolution-hackers] > > Giving the user a choice could work. We can't *just* autodetect based on > the UTF8. In a string like "The character for the word 'one' is > <<U+4E00>>", the last character could be Japanese, Simplified Chinese, > or Traditional Chinese (or even Korean sometimes?). > > Is there any way for the composer to know whether the user is using a > Japanese or Chinese input method? (And are there separate traditional > and simplified chinese input methods?)
I don't think so, no. > > And what about cut+paste? If you paste characters from a Big5 web page, > does the composer know that or does it only get UTF8? Again, no. It just gets UTF-8 afaik. Let me remind you that this is for header encoding, not necessarily meant for encoding message bodies. We already have code that works for message bodies in the composer. Granted, if we can come up with some logic that will allow the user some preference over what charset has a higher priority, etc - then maybe we can just use camel_charset_best() for the message bodies as well? It sounds like most of your arguments are based on the belief that this will be used for message bodies, which is not where I intended this to go necessarily (although it might simplify things if we could?). NotZed was thinking that maybe we could generate the charset map table at runtime based on some suer ordering of the charsets or at least allow the locale charset to have priority over the current charsets used in camel's charset table. The one thing I see as maybe a problem with this approach i that it seems some users are in an iso-8859-1 locale but want to be able to write japanese or whatever. Now what? For this particular message he'd probably want iso-2022-jp to have priority, whereas his locale is iso-8859-1 (and maybe even most of the time he'd prefer iso-8859-1 had priority). Okay, maybe this particular example is a bad one, lets pretend locale is some asian charset and he wants to compose sometimes in another asian charset. This is probably more complicated than the iso-8859-1/iso-2022-jp charset because iso-8859-1 is not a multibyte charset and obviously iso-8859-1 should always have priority over iso-2022-jp (for the sake of interoperability with a wider variety of mail clients). My guess is that order of preference will have to be something like: iso-8859-1 (no need to put this in the table) iso-8859-2 iso-8859-4 koi8-r koi8-u iso-8859-5 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-13 iso-8859-15 windows-cp1251 user-defined user-defined user-defined user-defined ... UTF-8 (no need for this to be in the table either) Now, what happens if a user chooses an 8bit charset? do we somehow re-prioritise? How can we? Maybe we should expand that table to include all the 8bit charsets that users are likely to care about (do we already have this? what charsets do we add if we don't?) and then make it so that user-defined charsets can only be multibyte charsets? Maybe I'm making this more complicated than it needs to be... I would just prefer to use a table like this rather than having to attempt to iconv() to a ton of different charsets like we do in the composer. It's just a very expensive proccess to have to do that. danw: question for you. You said that greek and russian could be expressed in iso-2022. But if russian and greek have a higher priority than iso-2022, then why would this be a problem? I'm guessing that you mean only if greek and russian text appear together. However, if they are expressed together and we do mistakenly detect them as iso-2022, then wouldn't they still decode back to greek and russian glyphs? Or would converting the greek/russian glyphs from UTF-8 to iso-2022 destroy it and produce garbage iso-2022 glyphs? If the resultant iso-2022 encoded string can be converted back to UTF-8 while still preserving the greek and russian chars, then does it really matter? No matter what we do, we run the risk of encoding it to the "wrong" charset. Even if we were to always check locale first etc, because it's possible that the user is replying to a message composed in a different/incompatable charset and so we wouldn't be able to encode to the user's locale. Anyways, the reason why this whole charset issue was brought up again is because we currently encode asian charsets in UTF-8 *always* in headers for outgoing messages. This is apparently a problem because very few mail clients (including Outlook 6 - which is part of Office 2002?!) still don't understand UTF-8. Jeff > > -- Dan > > On Wed, 2002-05-01 at 21:28, Not Zed wrote: > > Yes we need this code, as we needed it when it was written. > > > > If nothing else, we could potentially use it to offer the user a choice > > (as emacs does), or use it to determine if the users locale charset is a > > valid option, or even for things like autodetecting unknown data (using > > locale as a hint). > > > > The code is priority based at least. So you just order the super-meta > > charsets last, so they wont be chosen for normal text, and maybe even > > special case them based on locale so utf8 is usually preffered. > > > > On Wed, 2002-05-01 at 21:42, Dan Winship wrote: > > > > Order of preference seems to be iso-2022-jp, Shift-JIS, and then euc-jp > > > > but neither Shift-JIS nor euc-jp are liked very much. They seem to only > > > > be common in the US for example. > > > > > > > > Korean users tend to prefer euc-kr over iso-2022-kr. > > > > > > Do the character sets actually contain vastly different data? Will > > > Shift-JIS, euc-jp, or iso-2022-kr ever get chosen? > > > For that matter, will the Chinese charsets ever get autodetected or will > > > it always use the Japanese ones instead (at least for messages > > > containing only reasonably common characters)? > > > > > > Also, does this patch address the issue that a message containing both > > > Greek and Russian *can* be encoded in iso-2022, but *should* be encoded > > > in UTF8? > > > > > > What problem exactly is this supposed to be solving? If you want to > > > autodetect Asian charsets for people who aren't replying to an > > > Asian-language message and don't have an Asian locale, I don't think > > > this will work. > > > > > > Heuristics that might work are "if it contains Korean characters (which > > > are all in a certain range in Unicode), try EUC-KR", "if it contains > > > Japanese hiragana/katakana (likewise), try iso-2022-jp", and "if it > > > contains unihan characters but not kana, it's probably Chinese". I don't > > > think you can autoselect between traditional and simplified Chinese > > > charsets based on a UTF8 input stream though. > > > > > > -- Dan > > > > > > > > > _______________________________________________ > > > Evolution-patches maillist - [EMAIL PROTECTED] > > > http://lists.ximian.com/mailman/listinfo/evolution-patches > > > > > _______________________________________________ > evolution-hackers maillist - [EMAIL PROTECTED] > http://lists.ximian.com/mailman/listinfo/evolution-hackers > _______________________________________________ evolution-hackers maillist - [EMAIL PROTECTED] http://lists.ximian.com/mailman/listinfo/evolution-hackers
