>Ken Hornstein wrote: >> Even if it can, I am unsure we can maintain >> the correct column position when dealing with things like combining >> characters. > >That is possible. wcwidth() returns 0 for combining characters.
As I learned the hard way, that is NOT necessarily true. Although the problem there was in older versions of MacOS X; newer versions have fixed that. So that's a problem which is going away. But ... let's put that quote into context. I was speaking of the case where internal data representation is UTF-8, but the user has a non-UTF-8 locale (let's say ISO-8859-1). You can't use wcwidth() in this context because it wants to work on the current locale. Okay, that's technically not true; it works on whatever you give to setlocale(). But as I explained before, it's not practically possible to pick a UTF-8 locale if you're not already in one because a) the locale names are not standardized, and b) you don't know if a particular locale supports UTF-8 or not. Also, it's even more confusing than that. Let's assume that we could use xlocale or change the whole process locale to an UTF-8 locale. We then calculate the complete character width. But what happens when we convert that with iconv() to the native character set? If we do the usual subsitution for non-valid characters like '?', then what happens when we run into a combining character? Does the combining character end up with as a '?' ? If so, that messes up the length calculation. >Do we have any specific cases where forcing a UTF-8 assumption actually >helps? The POSIX API is clumsy but the fact that it deals in the current >locale rather than UTF-8 doesn't make much difference. The code needs an >API to know stuff like how wide a string is. Knowing you have a UTF-8 >encoding doesn't really gain you anything. Well, I think it helps in two cases: 1) If you have an UTF-8 locale 2) If you don't have an UTF-8 locale, but you still want to output UTF-8. For 1) it helps a little. For 2) it helps more. But ... I think people in case 2) are wrong to have their system setup that way. I mean, seriously ... you're telling your operating system that you only support ASCII, but you want us to output UTF-8 anyway? How does that even make any sense? >I think it'd be better to focus on real features. So if you want, for >example, character substitution on conversion failure and libunistring >helps then configure can check for it and disable the feature if it >isn't found. As an aside, that particular feature only sounds useful if >you're actually using a non-UTF-8 locale. Well ... I am reluctant to make that optional. At this point character conversion really isn't an optional feature in a MUA. I know, some people are foolishly disabling iconv support in nmh (partially because of their lousy settings, partially because of our bugs). But really, you're expected to be able to handle different character sets at this point; you need to be able to convert that. iconv is a POSIX API; it's not perfect by any means, but at least it works and is widely supported. Supporting two or three codepaths (one without any character conversion, one with iconv, one with libunistring) seems like a bad idea to me, especially when it's part of core functionality; I'm fine with optional things like TLS and SASL support (well, maybe those things aren't so optional anymore in practice), but everyone needs character conversion nowadays. I'd rather pick one option and stick with it. If it's iconv, great. If it's libicu/libunistring, great. Now as for the idea of focusing on features: yes, completely agree that's important! But the decisions we make now in terms of internals really do matter on how we implement those features. I don't really see broad disagreement on the features: it's just more along the lines of 'how do we get there?'. >Given that nmh is BSD licenced, I'd probably favour libicu over >libunistring just for its licence. Checking on a Debian system, neither >have vast numbers of reverse dependencies. libicu/libunistring are great if you need to manipulate UTF-8 strings. My issue is: I am not clear that's necessary for us. So, what was the point of all this? I guess for once rather than fumbling around and glomming on some MIME support later, it would make sense to sit down, figure out how we want nmh to work, and then make that happen. Right now I have a SLIGHT lean toward having the format engine represent stuff in the native character set. But this isn't perfect, and let me give you an example why. Let's say someone sends you an email that contains UTF-8 in their real name field, with a character that is only in Unicode. This email is NOT encoded using RFC-2047, but is simple bare UTF-8 (which is now permitted as part of the new email RFCs). If your locale is ISO-8859-1, or even worse, ASCII (seriously, WHAT THE HELL PEOPLE?!?? It's 2015!!!), then converting the name to the local character set means you lose characters in their name ... and that seems terrible to me. We could do something like convert to RFC-2047 encoding in that case. But what if the email. address itself contains UTF-8? We can't do RFC-2047 encoding in that case. Hm, I think I just talked myself into a slight lean toward havin the format engine being UTF-8 internally. Sigh, the bottom line is that there are no good answers. It would be helpful if people might suggest what they expect/want to happen if a user received a message/global email and they're using a non-UTF-8 locale. "Shit breaks" is an acceptable answer :-) --Ken _______________________________________________ Nmh-workers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/nmh-workers
