Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Geoff Clare wrote in <20220526085434.GA19184@localhost>: |Steffen Nurpmeso wrote, on 24 May 2022: |> |> I find that "setlocale() may invalidate the string" painful, |> because many functions of the C library do not have _l() variants |> that could work with a uselocale() object. Just think about the |> scanf() that is used so often, or strtol(): you cannot even |> convert a number by standard means. | |You are mixing up uselocale() and newlocale(). | |The _l() functions and uselocale() are different ways to make use |of a locale object obtained from newlocale(). | |If there is no _l() function, you can pass the locale object to |uselocale() to set a thread-local current locale which must then |be used by functions that use the current locale, such as scanf() |and strtol(). These functions only use the "global locale" (set |by setlocale()) if there is no thread-local current locale set. That is true. (But i think this is one more occasion where Stroustrup's "a C++ may even be faster, because problems can be solved differently", cited more or less correctly, C++ 98, turns out to be correct.) |-- |Geoff Clare |The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England --End of <20220526085434.GA19184@localhost> --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Steffen Nurpmeso wrote, on 24 May 2022: > > I find that "setlocale() may invalidate the string" painful, > because many functions of the C library do not have _l() variants > that could work with a uselocale() object. Just think about the > scanf() that is used so often, or strtol(): you cannot even > convert a number by standard means. You are mixing up uselocale() and newlocale(). The _l() functions and uselocale() are different ways to make use of a locale object obtained from newlocale(). If there is no _l() function, you can pass the locale object to uselocale() to set a thread-local current locale which must then be used by functions that use the current locale, such as scanf() and strtol(). These functions only use the "global locale" (set by setlocale()) if there is no thread-local current locale set. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Thank you for the reply. Geoff Clare wrote: > > https://posix.rhansen.org/p/gettext_draft > > Line 573 > > In today's call we made changes along the lines you suggest. Please > check the updated etherpad to see if they achieve what you wanted. The change is good, from my POV. Thank you. Bruno
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Geoff Clare wrote in <20220524091849.GC25920@localhost>: |Bruno Haible wrote, on 12 May 2022: |> |> https://posix.rhansen.org/p/gettext_draft |> Line 573 |> |> "The application shall ensure that the codeset argument, if non-empty, \ |> is a |> valid codeset name that can be used as the tocode argument of the \ |> iconv_open() |> function." |> |> This is not the only requirement. We also need the requirement that \ |> the NUL |> character of ASCII maps to a single NUL byte in the codeset. Otherwise \ |> the |> iconv() processing inside gettext() is likely to malfunction. |> |> Suggestion: Change |> "... iconv_open() function." |> to |> "... iconv_open() function, and that the NUL character corresponds to a |> single NUL byte in codeset. So, the codeset may not be, for example, |> "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE"\ |> , |> "UTF-32LE", "UTF-7"." | |In today's call we made changes along the lines you suggest. Please |check the updated etherpad to see if they achieve what you wanted. But can it be any more generic than that in the codeset it specifies, the NUL character corresponds to a single NUL byte. that is the question. I personally never liked gettext(). I just did something with a dictionary, and used block-injecting C preprocessor macros for calls, because the ({ static size_t gen_cnt;.. }) right-hand-side extension never made it into a standard, and it is wasteful to call functions for nothing, especially when the gen_cnt will be set only once and never change in "real life". I find that "setlocale() may invalidate the string" painful, because many functions of the C library do not have _l() variants that could work with a uselocale() object. Just think about the scanf() that is used so often, or strtol(): you cannot even convert a number by standard means. If i were to design this, i would center on bindtextdomain(), and just keep it going. That is of course easier said than done, as only existing behaviour is streamlined and standardized. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Line 573 > > "The application shall ensure that the codeset argument, if non-empty, is a > valid codeset name that can be used as the tocode argument of the > iconv_open() > function." > > This is not the only requirement. We also need the requirement that the NUL > character of ASCII maps to a single NUL byte in the codeset. Otherwise the > iconv() processing inside gettext() is likely to malfunction. > > Suggestion: Change > "... iconv_open() function." > to > "... iconv_open() function, and that the NUL character corresponds to a > single NUL byte in codeset. So, the codeset may not be, for example, > "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE", > "UTF-32LE", "UTF-7"." In today's call we made changes along the lines you suggest. Please check the updated etherpad to see if they achieve what you wanted. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Hello! Steffen Nurpmeso wrote in <20220513135904.hhnsw%stef...@sdaoden.eu>: |Steffen Nurpmeso wrote in | <20220513132857.xzhqq%stef...@sdaoden.eu>: ||Harald van Dijk wrote in || <9aa0b43f-c5de-1698-9f34-c725a40e6...@gigawatt.nl>: |||On 12/05/2022 23:10, Steffen Nurpmeso wrote: |||> Harald van Dijk wrote in |||> : |||>|On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open |||>|Group wrote: |||>|> Bruno Haible wrote in |||>|> <4298913.vrqWZg68TM@omega>: | ... |||> LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-16 | od -t c |||> 000 \0 \0 a \0 b \0 \0 \0 |||> |||> Two leading NULs? ||| |||This is not what GNU iconv prints at all, at least not on my system, |||which just uses the GNU version unmodified. Rather, it prints || ||Interesting. Unmodified here too. Bruno Haible contacted me in ||private, i gave him all i have. | |Looking at the code (iconvdata/utf-16.c) i admit i fail to see how |this can happen, except maybe due to gcc 11.2.0 miscompilation |(CFLAGS="-O2 -march=x86-64 -pipe", shall that be honoured). The |above is however surely what i see here, reproducably. Bruno Haible had the fantastic idea of checking od(1), and that was it! When i use hexdump -C the BOM is back. Or the GNU version of od(1). --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
On 13/05/2022 14:28, Steffen Nurpmeso wrote: You again strip content of follow-up RFCs. In my previous message, I quoted your message that I replied to in full. I stripped absolutely nothing and do not appreciate your utterly false claim here. Retract it. Harald van Dijk
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Harald van Dijk wrote in <9aa0b43f-c5de-1698-9f34-c725a40e6...@gigawatt.nl>: |On 12/05/2022 23:10, Steffen Nurpmeso wrote: |> Harald van Dijk wrote in |> : |>|On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open |>|Group wrote: |>|> Bruno Haible wrote in |>|> <4298913.vrqWZg68TM@omega>: |>|>|Steffen Nurpmeso wrote: |>|>|> ... |>|>|>| [.] "UTF-7"." |>|>|> |>|>|> That is overshoot. |>|>| |>|>|No. UTF-7 is invalid here because it produces output that is not NUL |>|>|terminated. See: |>|>| |>|>|$ printf 'ab\0' | iconv -t UTF-7 | od -t c |>|>|000 a b + A A A - |>|>|007 |>|>| |>|>|strlen() on such a return value makes invalid memory accesses. |>|>|You can convince yourself by running |>|>|$ OUTPUT_CHARSET=UTF-7 valgrind ls --help |>|> |>|> This is then surely bogus? UTF-7 is a normal single byte |>|> character set and is to be terminated like anything else. Nothing |>|> in RFC 2152 nor RFC 3501 if you want makes me think something |>|> else. |>| |>|RFC 2152's rules 1 and 3 only allow specifying the listed characters as |>|their ASCII form. All other characters, including U+, must be |>|encoded using rule 2. GNU iconv is doing what the RFC specifies here. |> |> No really, please. And please do not strip important content, | |I didn't think I did. You didn't read the RFC properly, I replied to You again strip content of follow-up RFCs. I have implemented UTF-7, and i definitely terminate C-style strings. ... |> LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-16 | od -t c |> 000 \0 \0 a \0 b \0 \0 \0 |> |> Two leading NULs? | |This is not what GNU iconv prints at all, at least not on my system, |which just uses the GNU version unmodified. Rather, it prints Interesting. Unmodified here too. Bruno Haible contacted me in private, i gave him all i have. ... |you may want to report this, including steps on how to get a GNU iconv I have given up on reporting bugs on sourceware bug tracker. The reason is on this list i think. I skip the rest. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
On 12/05/2022 23:10, Steffen Nurpmeso wrote: Harald van Dijk wrote in : |On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open |Group wrote: |> Bruno Haible wrote in |> <4298913.vrqWZg68TM@omega>: |>|Steffen Nurpmeso wrote: |>|> ... |>|>| [.] "UTF-7"." |>|> |>|> That is overshoot. |>| |>|No. UTF-7 is invalid here because it produces output that is not NUL |>|terminated. See: |>| |>|$ printf 'ab\0' | iconv -t UTF-7 | od -t c |>|000 a b + A A A - |>|007 |>| |>|strlen() on such a return value makes invalid memory accesses. |>|You can convince yourself by running |>|$ OUTPUT_CHARSET=UTF-7 valgrind ls --help |> |> This is then surely bogus? UTF-7 is a normal single byte |> character set and is to be terminated like anything else. Nothing |> in RFC 2152 nor RFC 3501 if you want makes me think something |> else. | |RFC 2152's rules 1 and 3 only allow specifying the listed characters as |their ASCII form. All other characters, including U+, must be |encoded using rule 2. GNU iconv is doing what the RFC specifies here. No really, please. And please do not strip important content, I didn't think I did. You didn't read the RFC properly, I replied to show where and how the RFC specifies exactly what GNU iconv does, the rest of your message looks like it's based on the false assumption that the RFC specifies something other than what it does, which becomes irrelevant when that assumption is corrected. Looking in more detail, there is one thing I should have responded to. Included here. UTF-7. Heck, how about that, for example: LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-16 | od -t c 000 \0 \0 a \0 b \0 \0 \0 Two leading NULs? This is not what GNU iconv prints at all, at least not on my system, which just uses the GNU version unmodified. Rather, it prints 000 377 376 a \0 b \0 \0 \0 010 That is, it includes a BOM, just like it showed in your SunOS output. Both the GNU iconv that is shipped as part of GNU libc 2.35, and the GNU iconv that is shipped as part of GNU libiconv 1.16, print this. Those are the current releases. If you are testing an older release, or a modified version, that is important information missing from your message. If you are seeing the leading null bytes in a current version, you may want to report this, including steps on how to get a GNU iconv that behaves this way. i am neither Chinese nor Russian, and especially not one of the other 7 billion that do not count. (I said surely bogus because i alone see the shiny light of having found give-me-five GNU iconv errors. Or even beyond that.) This makes absolutely zero sense. I am including it only to pre-empt you again claiming I am stripping important content. Cheers, Harald van Dijk
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Harald van Dijk wrote in : |On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open |Group wrote: |> Bruno Haible wrote in |> <4298913.vrqWZg68TM@omega>: |>|Steffen Nurpmeso wrote: |>|> ... |>|>| [.] "UTF-7"." |>|> |>|> That is overshoot. |>| |>|No. UTF-7 is invalid here because it produces output that is not NUL |>|terminated. See: |>| |>|$ printf 'ab\0' | iconv -t UTF-7 | od -t c |>|000 a b + A A A - |>|007 |>| |>|strlen() on such a return value makes invalid memory accesses. |>|You can convince yourself by running |>|$ OUTPUT_CHARSET=UTF-7 valgrind ls --help |> |> This is then surely bogus? UTF-7 is a normal single byte |> character set and is to be terminated like anything else. Nothing |> in RFC 2152 nor RFC 3501 if you want makes me think something |> else. | |RFC 2152's rules 1 and 3 only allow specifying the listed characters as |their ASCII form. All other characters, including U+, must be |encoded using rule 2. GNU iconv is doing what the RFC specifies here. No really, please. And please do not strip important content, i am neither Chinese nor Russian, and especially not one of the other 7 billion that do not count. (I said surely bogus because i alone see the shiny light of having found give-me-five GNU iconv errors. Or even beyond that.) --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open Group wrote: Bruno Haible wrote in <4298913.vrqWZg68TM@omega>: |Steffen Nurpmeso wrote: |> ... |>| [.] "UTF-7"." |> |> That is overshoot. | |No. UTF-7 is invalid here because it produces output that is not NUL |terminated. See: | |$ printf 'ab\0' | iconv -t UTF-7 | od -t c |000 a b + A A A - |007 | |strlen() on such a return value makes invalid memory accesses. |You can convince yourself by running |$ OUTPUT_CHARSET=UTF-7 valgrind ls --help This is then surely bogus? UTF-7 is a normal single byte character set and is to be terminated like anything else. Nothing in RFC 2152 nor RFC 3501 if you want makes me think something else. RFC 2152's rules 1 and 3 only allow specifying the listed characters as their ASCII form. All other characters, including U+, must be encoded using rule 2. GNU iconv is doing what the RFC specifies here. Cheers, Harald van Dijk
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Bruno Haible wrote in <4298913.vrqWZg68TM@omega>: |Steffen Nurpmeso wrote: |> ... |>| [.] "UTF-7"." |> |> That is overshoot. | |No. UTF-7 is invalid here because it produces output that is not NUL |terminated. See: | |$ printf 'ab\0' | iconv -t UTF-7 | od -t c |000 a b + A A A - |007 | |strlen() on such a return value makes invalid memory accesses. |You can convince yourself by running |$ OUTPUT_CHARSET=UTF-7 valgrind ls --help This is then surely bogus? UTF-7 is a normal single byte character set and is to be terminated like anything else. Nothing in RFC 2152 nor RFC 3501 if you want makes me think something else. (RFC 5092 "IMAP URL Scheme", which invents the sane-enough- to-think-yourself "UTF-7 -> UTF-16 -> UCS-4 -> UTF-8 -> HEX" conversion scheme, and reverse, even implies the opposite, the example functions both NUL terminate the string.) Except Mark Davis said something like "UTF-7 was a failure" once on the Unicode ML, if i recall correctly, and i surely added "sadly", given the Punycode mess with domain names. But one more ship that sailed. But a pity it is. Why should NUL be treated differently?? No. No, i think it is a bug in GNU iconv that noone stumbled upon because noone is using UTF-7. Heck, how about that, for example: LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-16 | od -t c 000 \0 \0 a \0 b \0 \0 \0 Two leading NULs? LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t ucs-2 | od -t c 000 a \0 b \0 \0 \0 That yes. LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-8 | od -t c 000 a b \0 Yes. LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-7 | od -t c 000 a b + A A A - No. Somehow they all bogus, take SunOS 5.10: LC_ALL=C printf 'ab\0' | iconv -f iso-8859-1 -t utf-16 | od -t 000 376 377 \0 a \0 b \0 \0 Ooh, now it gets scary!! Interestingly OpenBSD 7.1 behaves the same, likely it is an old instance of GNU iconv thus, there it says "GNU libiconv 1.16", here it says "iconv (GNU libc) 2.35". So unless someone convinces me you are arguing based on buggy software. UTF-7 is just another 7-bit single byte character set, and thus. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Steffen Nurpmeso wrote: > ... > | [.] "UTF-7"." > > That is overshoot. No. UTF-7 is invalid here because it produces output that is not NUL terminated. See: $ printf 'ab\0' | iconv -t UTF-7 | od -t c 000 a b + A A A - 007 strlen() on such a return value makes invalid memory accesses. You can convince yourself by running $ OUTPUT_CHARSET=UTF-7 valgrind ls --help Bruno
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Bruno Haible wrote in <24562059.ssLaC8jLEa@omega>: ... | [.] "UTF-7"." That is overshoot. (Though i'd wish they would have used it for internationalized domain names.) --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
POSIX bind_textdomain_codeset(): some invalid codeset arguments
https://posix.rhansen.org/p/gettext_draft Line 573 "The application shall ensure that the codeset argument, if non-empty, is a valid codeset name that can be used as the tocode argument of the iconv_open() function." This is not the only requirement. We also need the requirement that the NUL character of ASCII maps to a single NUL byte in the codeset. Otherwise the iconv() processing inside gettext() is likely to malfunction. Suggestion: Change "... iconv_open() function." to "... iconv_open() function, and that the NUL character corresponds to a single NUL byte in codeset. So, the codeset may not be, for example, "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE", "UTF-32LE", "UTF-7"."