>I am in no way an expert on this. But, I won't let that stop me. Welcome to the club! I think we're all in the same boat in that regards.
>It seems to me that the only solution is to use Unicode internally. >Disgusting as it seems to those of us who are old enough to hoard >bytes, we might want to consider using something other than UTF-8 >for the internal representation. Using UTF-16 wouldn't be horrible >but I recall that the Unicode folks made a botch of things so that >one really needs 24 bits now, which really means using 32 internally. AFAICT ... there is probably no advantage in using UTF-16 or UTF-32 versus UTF-8. People might think that you gain something because with UTF-16 two bytes == 1 character. But that's only true for things in the Basic Multilingual Plane, and people are now telling us 🖕 because they want to send emoji in email which are NOT part of the BMP, which means we have to start dealing with 💩 like surrogate pairs. And really, even with just the BMP combining characters toss that idea out of the window UTF-32 lets you say 4 bytes == 1 character ... but do we care about 'characters' or 'column positions'? So given that, I think sticking with UTF-8 is preferrable; it has the nice property that we can represent text as C strings and it's just ASCII if you're living in a 7-bit world. >On the output side, we just have to do the best we can if characters in >the input locale can't be represented in the output locale. This is >independent of the internal representation. Well, this works great if your locale is UTF-8. But ... what happens if your email address contains UTF-8, and your locale setting is ISO-8859-1? --Ken _______________________________________________ Nmh-workers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/nmh-workers
