Re: [Evolution-hackers] Re: charset foo

Jeffrey Stedfast Thu, 02 May 2002 08:03:44 -0700

On Thu, 2002-05-02 at 10:31, Dan Winship wrote:
> [moving from evolution-patches to evolution-hackers]
> 
> Giving the user a choice could work. We can't *just* autodetect based on
> the UTF8. In a string like "The character for the word 'one' is
> <<U+4E00>>", the last character could be Japanese, Simplified Chinese,
> or Traditional Chinese (or even Korean sometimes?).
> 
> Is there any way for the composer to know whether the user is using a
> Japanese or Chinese input method? (And are there separate traditional
> and simplified chinese input methods?)

I don't think so, no.

> 
> And what about cut+paste? If you paste characters from a Big5 web page,
> does the composer know that or does it only get UTF8?

Again, no. It just gets UTF-8 afaik.

Let me remind you that this is for header encoding, not necessarily
meant for encoding message bodies. We already have code that works for
message bodies in the composer. Granted, if we can come up with some
logic that will allow the user some preference over what charset has a
higher priority, etc - then maybe we can just use camel_charset_best()
for the message bodies as well?

It sounds like most of your arguments are based on the belief that this
will be used for message bodies, which is not where I intended this to
go necessarily (although it might simplify things if we could?).

NotZed was thinking that maybe we could generate the charset map table
at runtime based on some suer ordering of the charsets or at least allow
the locale charset to have priority over the current charsets used in
camel's charset table.

The one thing I see as maybe a problem with this approach i that it
seems some users are in an iso-8859-1 locale but want to be able to
write japanese or whatever. Now what? For this particular message he'd
probably want iso-2022-jp to have priority, whereas his locale is
iso-8859-1 (and maybe even most of the time he'd prefer iso-8859-1 had
priority).

Okay, maybe this particular example is a bad one, lets pretend locale is
some asian charset and he wants to compose sometimes in another asian
charset. This is probably more complicated than the
iso-8859-1/iso-2022-jp charset because iso-8859-1 is not a multibyte
charset and obviously iso-8859-1 should always have priority over
iso-2022-jp (for the sake of interoperability with a wider variety of
mail clients).

My guess is that order of preference will have to be something like:

iso-8859-1 (no need to put this in the table)

iso-8859-2
iso-8859-4
koi8-r
koi8-u
iso-8859-5
iso-8859-7
iso-8859-8
iso-8859-9
iso-8859-13
iso-8859-15
windows-cp1251
user-defined
user-defined
user-defined
user-defined
...

UTF-8  (no need for this to be in the table either)

Now, what happens if a user chooses an 8bit charset? do we somehow
re-prioritise? How can we? Maybe we should expand that table to include
all the 8bit charsets that users are likely to care about (do we already
have this? what charsets do we add if we don't?) and then make it so
that user-defined charsets can only be multibyte charsets?

Maybe I'm making this more complicated than it needs to be...

I would just prefer to use a table like this rather than having to
attempt to iconv() to a ton of different charsets like we do in the
composer. It's just a very expensive proccess to have to do that.

danw: question for you. You said that greek and russian could be
expressed in iso-2022. But if russian and greek have a higher priority
than iso-2022, then why would this be a problem? I'm guessing that you
mean only if greek and russian text appear together. However, if they
are expressed together and we do mistakenly detect them as iso-2022,
then wouldn't they still decode back to greek and russian glyphs? Or
would converting the greek/russian glyphs from UTF-8 to iso-2022 destroy
it and produce garbage iso-2022 glyphs? If the resultant iso-2022
encoded string can be converted back to UTF-8 while still preserving the
greek and russian chars, then does it really matter?

No matter what we do, we run the risk of encoding it to the "wrong"
charset. Even if we were to always check locale first etc, because it's
possible that the user is replying to a message composed in a
different/incompatable charset and so we wouldn't be able to encode to
the user's locale.

Anyways, the reason why this whole charset issue was brought up again is
because we currently encode asian charsets in UTF-8 *always* in headers
for outgoing messages. This is apparently a problem because very few
mail clients (including Outlook 6 - which is part of Office 2002?!)
still don't understand UTF-8.

Jeff

> 
> -- Dan
> 
> On Wed, 2002-05-01 at 21:28, Not Zed wrote:
> > Yes we need this code, as we needed it when it was written.
> > 
> > If nothing else, we could potentially use it to offer the user a choice
> > (as emacs does), or use it to determine if the users locale charset is a
> > valid option, or even for things like autodetecting unknown data (using
> > locale as a hint).
> > 
> > The code is priority based at least.  So you just order the super-meta
> > charsets last, so they wont be chosen for normal text, and maybe even
> > special case them based on locale so utf8 is usually preffered.
> > 
> > On Wed, 2002-05-01 at 21:42, Dan Winship wrote:
> > > > Order of preference seems to be iso-2022-jp, Shift-JIS, and then euc-jp
> > > > but neither Shift-JIS nor euc-jp are liked very much. They seem to only
> > > > be common in the US for example.
> > > >
> > > > Korean users tend to prefer euc-kr over iso-2022-kr.
> > > 
> > > Do the character sets actually contain vastly different data? Will 
> > > Shift-JIS, euc-jp, or iso-2022-kr ever get chosen?
> > > For that matter, will the Chinese charsets ever get autodetected or will 
> > > it always use the Japanese ones instead (at least for messages 
> > > containing only reasonably common characters)?
> > > 
> > > Also, does this patch address the issue that a message containing both 
> > > Greek and Russian *can* be encoded in iso-2022, but *should* be encoded 
> > > in UTF8?
> > > 
> > > What problem exactly is this supposed to be solving? If you want to 
> > > autodetect Asian charsets for people who aren't replying to an 
> > > Asian-language message and don't have an Asian locale, I don't think 
> > > this will work.
> > > 
> > > Heuristics that might work are "if it contains Korean characters (which 
> > > are all in a certain range in Unicode), try EUC-KR", "if it contains 
> > > Japanese hiragana/katakana (likewise), try iso-2022-jp", and "if it 
> > > contains unihan characters but not kana, it's probably Chinese". I don't 
> > > think you can autoselect between traditional and simplified Chinese 
> > > charsets based on a UTF8 input stream though.
> > > 
> > > -- Dan
> > > 
> > > 
> > > _______________________________________________
> > > Evolution-patches maillist  -  [EMAIL PROTECTED]
> > > http://lists.ximian.com/mailman/listinfo/evolution-patches
> > 
> 
> 
> _______________________________________________
> evolution-hackers maillist  -  [EMAIL PROTECTED]
> http://lists.ximian.com/mailman/listinfo/evolution-hackers
> 

_______________________________________________
evolution-hackers maillist  -  [EMAIL PROTECTED]
http://lists.ximian.com/mailman/listinfo/evolution-hackers

Re: [Evolution-hackers] Re: charset foo

Reply via email to