Hello. shwaresyst wrote in <1494661216.220561.1641574109...@mail.yahoo.com>: [i resort a bit] | On Thu, Jan 6, 2022 at 3:40 PM, Steffen Nurpmeso via austin-group-l \ | at The Open Group<austin-group-l@opengroup.org> wrote: Hello! | |I wonder about POSIX.utf-?8, i tried to remember any statement |i had read, and Mantis did not show up results. | |In particular i am interested in whether LC_CTYPE results will |bring true Unicode support or not, the reason i am asking is that |the upcoming version of my work-box GNU LibC-based (2.34) Linux |distribution will provide it like | | localedef -i POSIX -f UTF-8 $PKG/usr/lib/locale/C.UTF-8 2> /dev/null \ ||| true | |and then this thing is detected as an UTF-8 locale, but causes |three test failures of the MUA i maintain because character set |conversion behaves differently. | |My personal opinion was that POSIX.utf8 will bring the complete |range of Unicode characters to at least LC_CTYPE, i wonder about |LC_COLLATE, as language matching is, hm, very language specific. |The rest not (maybe LC_MESSAGES going for UTF-8 though). | |Is that approximately correct?
|The first Issue 8 draft is focusing, afaik, on adding the C1x changes \ |and Mantis Issue 8 tagged items. The changes to XBD 6, 7, etc., that \ |will formally add a POSIX UTF8 locale are to be part of the second, \ |maybe third, draft. This is why you don't see them yet. |For maximum compatibility with existing practice the required base \ |repertoire for this will likely be some subset of UCS-2, plus ISO-6429 \ 16-bit characters i do not see in POSIX, going that route would make impossible implementations which use specific bit patterns in wchar_t, which, if i recall correctly from 2014 or when i was looking into the issue, is used by at least the Citrus implementation of the mb* and w* series for at least some asian languages. And more .. but that was not the issue i am concerned about at the moment anyhow, i personally would assume 8-bit aka UTF-8 character strings to be predominant in Unix based systems, they surely are in the predominant ones. (Even though, i have to say, UTF-16 aka 16-bit characters do have their value for the majority of the massively declining number of human languages, and the older i get the more i think using that as a base is a good decision.) |in full, not the complete range. I've hopes this will be significantly \ |more than the minimal repertoire of C2x, but it may not as a matter \ That made me look for and download a 2020 draft of ISO C2X, i did not have a look until now. |of deferral to the C standard. It should be left up to implementations \ |still, in my opinion, how much of the range beyond this base they want \ |to support as extensions, including UTF16 as an encoding. How the LC_* \ |categories will be extended to fully support that base repertoire accord\ |ing to the Unicode requirements hasn't been determined yet either, \ |but this is the nominal goal. And from a glance i do not see anything Unicode-enabled-locale wise. UTF-16 specifically i do not see ... as you will have to convert on input and on output in order to use it in your program, and then you can very well convert to the transparent wchar_t, or use the wide I/O series which gives it to you. Minimizing the tremendous deficiency that many traditional Unix programs have to face because the historic string interfaces do not provide proper functionality to deal with human languages is out of scope is it? At least it seems as if ISO C2X introduces support for UTF-8 as a native string representation ... in practice it seems Unix people use GNU libunicode (which explicitly supports UTF-(32|16|8) i think) as well as ICU (which i think used UTF-16 internally but offered improved UTF-8 interface performance by then), so the ISO standard people were able to simply ignore their responsibility and focused on mysterious s..t decisions, and POSIX has to follow ISO C suit for one, and then simply had not the ressources to define an entire Unicode string interface by themselve ... and so practice has created its own Genesis. Thank you. And ciao from Germany, --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)