Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-26 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Geoff Clare wrote in
 <20220526085434.GA19184@localhost>:
 |Steffen Nurpmeso wrote, on 24 May 2022:
 |>
 |>   I find that "setlocale() may invalidate the string" painful,
 |>   because many functions of the C library do not have _l() variants
 |>   that could work with a uselocale() object.  Just think about the
 |>   scanf() that is used so often, or strtol(): you cannot even
 |>   convert a number by standard means.
 |
 |You are mixing up uselocale() and newlocale().
 |
 |The _l() functions and uselocale() are different ways to make use
 |of a locale object obtained from newlocale().
 |
 |If there is no _l() function, you can pass the locale object to
 |uselocale() to set a thread-local current locale which must then
 |be used by functions that use the current locale, such as scanf()
 |and strtol().  These functions only use the "global locale" (set
 |by setlocale()) if there is no thread-local current locale set.

That is true.
(But i think this is one more occasion where Stroustrup's "a C++
may even be faster, because problems can be solved differently",
cited more or less correctly, C++ 98, turns out to be correct.)

 |-- 
 |Geoff Clare 
 |The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
 --End of <20220526085434.GA19184@localhost>

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-26 Thread Geoff Clare via austin-group-l at The Open Group
Steffen Nurpmeso wrote, on 24 May 2022:
>
>   I find that "setlocale() may invalidate the string" painful,
>   because many functions of the C library do not have _l() variants
>   that could work with a uselocale() object.  Just think about the
>   scanf() that is used so often, or strtol(): you cannot even
>   convert a number by standard means.

You are mixing up uselocale() and newlocale().

The _l() functions and uselocale() are different ways to make use
of a locale object obtained from newlocale().

If there is no _l() function, you can pass the locale object to
uselocale() to set a thread-local current locale which must then
be used by functions that use the current locale, such as scanf()
and strtol().  These functions only use the "global locale" (set
by setlocale()) if there is no thread-local current locale set.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group
Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 573
> 
> In today's call we made changes along the lines you suggest. Please
> check the updated etherpad to see if they achieve what you wanted.

The change is good, from my POV. Thank you.

Bruno





Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Geoff Clare wrote in
 <20220524091849.GC25920@localhost>:
 |Bruno Haible wrote, on 12 May 2022:
 |>
 |> https://posix.rhansen.org/p/gettext_draft
 |> Line 573
 |> 
 |> "The application shall ensure that the codeset argument, if non-empty, \
 |> is a
 |>  valid codeset name that can be used as the tocode argument of the \
 |>  iconv_open()
 |>  function."
 |> 
 |> This is not the only requirement. We also need the requirement that \
 |> the NUL
 |> character of ASCII maps to a single NUL byte in the codeset. Otherwise \
 |> the
 |> iconv() processing inside gettext() is likely to malfunction.
 |> 
 |> Suggestion: Change
 |> "... iconv_open() function."
 |> to
 |> "... iconv_open() function, and that the NUL character corresponds to a
 |>  single NUL byte in codeset. So, the codeset may not be, for example,
 |>  "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE"\
 |>  ,
 |>  "UTF-32LE", "UTF-7"."
 |
 |In today's call we made changes along the lines you suggest. Please
 |check the updated etherpad to see if they achieve what you wanted.

But can it be any more generic than

  that in the codeset it specifies, the NUL character corresponds
  to a single NUL byte.

that is the question.

  I personally never liked gettext().  I just did something with
  a dictionary, and used block-injecting C preprocessor macros for
  calls, because the ({ static size_t gen_cnt;.. })
  right-hand-side extension never made it into a standard, and it
  is wasteful to call functions for nothing, especially when the
  gen_cnt will be set only once and never change in "real life".

  I find that "setlocale() may invalidate the string" painful,
  because many functions of the C library do not have _l() variants
  that could work with a uselocale() object.  Just think about the
  scanf() that is used so often, or strtol(): you cannot even
  convert a number by standard means.
  If i were to design this, i would center on bindtextdomain(),
  and just keep it going.
  That is of course easier said than done, as only existing
  behaviour is streamlined and standardized.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Line 573
> 
> "The application shall ensure that the codeset argument, if non-empty, is a
>  valid codeset name that can be used as the tocode argument of the 
> iconv_open()
>  function."
> 
> This is not the only requirement. We also need the requirement that the NUL
> character of ASCII maps to a single NUL byte in the codeset. Otherwise the
> iconv() processing inside gettext() is likely to malfunction.
> 
> Suggestion: Change
> "... iconv_open() function."
> to
> "... iconv_open() function, and that the NUL character corresponds to a
>  single NUL byte in codeset. So, the codeset may not be, for example,
>  "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE",
>  "UTF-32LE", "UTF-7"."

In today's call we made changes along the lines you suggest. Please
check the updated etherpad to see if they achieve what you wanted.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-13 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Hello!

Steffen Nurpmeso wrote in
 <20220513135904.hhnsw%stef...@sdaoden.eu>:
 |Steffen Nurpmeso wrote in
 | <20220513132857.xzhqq%stef...@sdaoden.eu>:
 ||Harald van Dijk wrote in
 || <9aa0b43f-c5de-1698-9f34-c725a40e6...@gigawatt.nl>:
 |||On 12/05/2022 23:10, Steffen Nurpmeso wrote:
 |||> Harald van Dijk wrote in
 |||>   :
 |||>|On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open
 |||>|Group wrote:
 |||>|> Bruno Haible wrote in
 |||>|>   <4298913.vrqWZg68TM@omega>:
 | ...
 |||>   LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t c
 |||>   000  \0  \0   a  \0   b  \0  \0  \0
 |||> 
 |||> Two leading NULs?
 |||
 |||This is not what GNU iconv prints at all, at least not on my system, 
 |||which just uses the GNU version unmodified. Rather, it prints
 ||
 ||Interesting.  Unmodified here too.  Bruno Haible contacted me in
 ||private, i gave him all i have.
 |
 |Looking at the code (iconvdata/utf-16.c) i admit i fail to see how
 |this can happen, except maybe due to gcc 11.2.0 miscompilation
 |(CFLAGS="-O2 -march=x86-64 -pipe", shall that be honoured).  The
 |above is however surely what i see here, reproducably.

Bruno Haible had the fantastic idea of checking od(1), and that
was it!  When i use hexdump -C the BOM is back.  Or the GNU
version of od(1).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-13 Thread Harald van Dijk via austin-group-l at The Open Group

On 13/05/2022 14:28, Steffen Nurpmeso wrote:

You again strip content of follow-up RFCs.


In my previous message, I quoted your message that I replied to in full. 
I stripped absolutely nothing and do not appreciate your utterly false 
claim here. Retract it.


Harald van Dijk



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-13 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Harald van Dijk wrote in
 <9aa0b43f-c5de-1698-9f34-c725a40e6...@gigawatt.nl>:
 |On 12/05/2022 23:10, Steffen Nurpmeso wrote:
 |> Harald van Dijk wrote in
 |>   :
 |>|On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open
 |>|Group wrote:
 |>|> Bruno Haible wrote in
 |>|>   <4298913.vrqWZg68TM@omega>:
 |>|>|Steffen Nurpmeso wrote:
 |>|>|>  ...
 |>|>|>| [.] "UTF-7"."
 |>|>|>
 |>|>|> That is overshoot.
 |>|>|
 |>|>|No. UTF-7 is invalid here because it produces output that is not NUL
 |>|>|terminated. See:
 |>|>|
 |>|>|$ printf 'ab\0' | iconv -t UTF-7 | od -t c
 |>|>|000   a   b   +   A   A   A   -
 |>|>|007
 |>|>|
 |>|>|strlen() on such a return value makes invalid memory accesses.
 |>|>|You can convince yourself by running
 |>|>|$ OUTPUT_CHARSET=UTF-7 valgrind ls --help
 |>|>
 |>|> This is then surely bogus?  UTF-7 is a normal single byte
 |>|> character set and is to be terminated like anything else.  Nothing
 |>|> in RFC 2152 nor RFC 3501 if you want makes me think something
 |>|> else.
 |>|
 |>|RFC 2152's rules 1 and 3 only allow specifying the listed characters as
 |>|their ASCII form. All other characters, including U+, must be
 |>|encoded using rule 2. GNU iconv is doing what the RFC specifies here.
 |> 
 |> No really, please.  And please do not strip important content,
 |
 |I didn't think I did. You didn't read the RFC properly, I replied to 

You again strip content of follow-up RFCs.
I have implemented UTF-7, and i definitely terminate C-style
strings.

  ...
 |>   LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t c
 |>   000  \0  \0   a  \0   b  \0  \0  \0
 |> 
 |> Two leading NULs?
 |
 |This is not what GNU iconv prints at all, at least not on my system, 
 |which just uses the GNU version unmodified. Rather, it prints

Interesting.  Unmodified here too.  Bruno Haible contacted me in
private, i gave him all i have.

  ...
 |you may want to report this, including steps on how to get a GNU iconv 

I have given up on reporting bugs on sourceware bug tracker.
The reason is on this list i think.

I skip the rest.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-13 Thread Harald van Dijk via austin-group-l at The Open Group

On 12/05/2022 23:10, Steffen Nurpmeso wrote:

Harald van Dijk wrote in
  :
  |On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open
  |Group wrote:
  |> Bruno Haible wrote in
  |>   <4298913.vrqWZg68TM@omega>:
  |>|Steffen Nurpmeso wrote:
  |>|>  ...
  |>|>| [.] "UTF-7"."
  |>|>
  |>|> That is overshoot.
  |>|
  |>|No. UTF-7 is invalid here because it produces output that is not NUL
  |>|terminated. See:
  |>|
  |>|$ printf 'ab\0' | iconv -t UTF-7 | od -t c
  |>|000   a   b   +   A   A   A   -
  |>|007
  |>|
  |>|strlen() on such a return value makes invalid memory accesses.
  |>|You can convince yourself by running
  |>|$ OUTPUT_CHARSET=UTF-7 valgrind ls --help
  |>
  |> This is then surely bogus?  UTF-7 is a normal single byte
  |> character set and is to be terminated like anything else.  Nothing
  |> in RFC 2152 nor RFC 3501 if you want makes me think something
  |> else.
  |
  |RFC 2152's rules 1 and 3 only allow specifying the listed characters as
  |their ASCII form. All other characters, including U+, must be
  |encoded using rule 2. GNU iconv is doing what the RFC specifies here.

No really, please.  And please do not strip important content,


I didn't think I did. You didn't read the RFC properly, I replied to 
show where and how the RFC specifies exactly what GNU iconv does, the 
rest of your message looks like it's based on the false assumption that 
the RFC specifies something other than what it does, which becomes 
irrelevant when that assumption is corrected. Looking in more detail, 
there is one thing I should have responded to. Included here.



UTF-7.  Heck, how about that, for example:

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t c
  000  \0  \0   a  \0   b  \0  \0  \0

Two leading NULs?


This is not what GNU iconv prints at all, at least not on my system, 
which just uses the GNU version unmodified. Rather, it prints


000 377 376   a  \0   b  \0  \0  \0
010

That is, it includes a BOM, just like it showed in your SunOS output. 
Both the GNU iconv that is shipped as part of GNU libc 2.35, and the GNU 
iconv that is shipped as part of GNU libiconv 1.16, print this. Those 
are the current releases. If you are testing an older release, or a 
modified version, that is important information missing from your 
message. If you are seeing the leading null bytes in a current version, 
you may want to report this, including steps on how to get a GNU iconv 
that behaves this way.



i am neither Chinese nor Russian, and especially not one of the
other 7 billion that do not count.
(I said surely bogus because i alone see the shiny light of having
found give-me-five GNU iconv errors.  Or even beyond that.)


This makes absolutely zero sense. I am including it only to pre-empt you 
again claiming I am stripping important content.


Cheers,
Harald van Dijk



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-12 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Harald van Dijk wrote in
 :
 |On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open 
 |Group wrote:
 |> Bruno Haible wrote in
 |>   <4298913.vrqWZg68TM@omega>:
 |>|Steffen Nurpmeso wrote:
 |>|>  ...
 |>|>| [.] "UTF-7"."
 |>|>
 |>|> That is overshoot.
 |>|
 |>|No. UTF-7 is invalid here because it produces output that is not NUL
 |>|terminated. See:
 |>|
 |>|$ printf 'ab\0' | iconv -t UTF-7 | od -t c
 |>|000   a   b   +   A   A   A   -
 |>|007
 |>|
 |>|strlen() on such a return value makes invalid memory accesses.
 |>|You can convince yourself by running
 |>|$ OUTPUT_CHARSET=UTF-7 valgrind ls --help
 |> 
 |> This is then surely bogus?  UTF-7 is a normal single byte
 |> character set and is to be terminated like anything else.  Nothing
 |> in RFC 2152 nor RFC 3501 if you want makes me think something
 |> else.
 |
 |RFC 2152's rules 1 and 3 only allow specifying the listed characters as 
 |their ASCII form. All other characters, including U+, must be 
 |encoded using rule 2. GNU iconv is doing what the RFC specifies here.

No really, please.  And please do not strip important content,
i am neither Chinese nor Russian, and especially not one of the
other 7 billion that do not count.
(I said surely bogus because i alone see the shiny light of having
found give-me-five GNU iconv errors.  Or even beyond that.)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-12 Thread Harald van Dijk via austin-group-l at The Open Group
On 12/05/2022 18:19, Steffen Nurpmeso via austin-group-l at The Open 
Group wrote:

Bruno Haible wrote in
  <4298913.vrqWZg68TM@omega>:
  |Steffen Nurpmeso wrote:
  |>  ...
  |>| [.] "UTF-7"."
  |>
  |> That is overshoot.
  |
  |No. UTF-7 is invalid here because it produces output that is not NUL
  |terminated. See:
  |
  |$ printf 'ab\0' | iconv -t UTF-7 | od -t c
  |000   a   b   +   A   A   A   -
  |007
  |
  |strlen() on such a return value makes invalid memory accesses.
  |You can convince yourself by running
  |$ OUTPUT_CHARSET=UTF-7 valgrind ls --help

This is then surely bogus?  UTF-7 is a normal single byte
character set and is to be terminated like anything else.  Nothing
in RFC 2152 nor RFC 3501 if you want makes me think something
else.


RFC 2152's rules 1 and 3 only allow specifying the listed characters as 
their ASCII form. All other characters, including U+, must be 
encoded using rule 2. GNU iconv is doing what the RFC specifies here.


Cheers,
Harald van Dijk



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-12 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Bruno Haible wrote in
 <4298913.vrqWZg68TM@omega>:
 |Steffen Nurpmeso wrote:
 |>  ...
 |>| [.] "UTF-7"."
 |> 
 |> That is overshoot.
 |
 |No. UTF-7 is invalid here because it produces output that is not NUL
 |terminated. See:
 |
 |$ printf 'ab\0' | iconv -t UTF-7 | od -t c
 |000   a   b   +   A   A   A   -
 |007
 |
 |strlen() on such a return value makes invalid memory accesses.
 |You can convince yourself by running
 |$ OUTPUT_CHARSET=UTF-7 valgrind ls --help

This is then surely bogus?  UTF-7 is a normal single byte
character set and is to be terminated like anything else.  Nothing
in RFC 2152 nor RFC 3501 if you want makes me think something
else.  (RFC 5092 "IMAP URL Scheme", which invents the sane-enough-
to-think-yourself "UTF-7 -> UTF-16 -> UCS-4 -> UTF-8 -> HEX"
conversion scheme, and reverse, even implies the opposite, the
example functions both NUL terminate the string.)
Except Mark Davis said something like "UTF-7 was a failure"
once on the Unicode ML, if i recall correctly, and i surely added
"sadly", given the Punycode mess with domain names.
But one more ship that sailed.  But a pity it is.
Why should NUL be treated differently??  No.  No, i think it is
a bug in GNU iconv that noone stumbled upon because noone is using
UTF-7.  Heck, how about that, for example:

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t c
  000  \0  \0   a  \0   b  \0  \0  \0

Two leading NULs?

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t ucs-2 | od -t c
  000   a  \0   b  \0  \0  \0

That yes.

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-8 | od -t c
  000   a   b  \0

Yes.

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-7 | od -t c
  000   a   b   +   A   A   A   -

No.  Somehow they all bogus, take SunOS 5.10:

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t
  000 376 377  \0   a  \0   b  \0  \0

Ooh, now it gets scary!!  Interestingly OpenBSD 7.1 behaves the
same, likely it is an old instance of GNU iconv thus, there it
says "GNU libiconv 1.16", here it says "iconv (GNU libc) 2.35".

So unless someone convinces me you are arguing based on buggy
software.  UTF-7 is just another 7-bit single byte character set,
and thus.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group
Steffen Nurpmeso wrote:
>  ...
>  | [.] "UTF-7"."
> 
> That is overshoot.

No. UTF-7 is invalid here because it produces output that is not NUL
terminated. See:

$ printf 'ab\0' | iconv -t UTF-7 | od -t c
000   a   b   +   A   A   A   -
007

strlen() on such a return value makes invalid memory accesses.
You can convince yourself by running
$ OUTPUT_CHARSET=UTF-7 valgrind ls --help

Bruno





Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-11 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Bruno Haible wrote in
 <24562059.ssLaC8jLEa@omega>:
 ...
 | [.] "UTF-7"."

That is overshoot.
(Though i'd wish they would have used it for internationalized
domain names.)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-11 Thread Bruno Haible via austin-group-l at The Open Group
https://posix.rhansen.org/p/gettext_draft
Line 573

"The application shall ensure that the codeset argument, if non-empty, is a
 valid codeset name that can be used as the tocode argument of the iconv_open()
 function."

This is not the only requirement. We also need the requirement that the NUL
character of ASCII maps to a single NUL byte in the codeset. Otherwise the
iconv() processing inside gettext() is likely to malfunction.

Suggestion: Change
"... iconv_open() function."
to
"... iconv_open() function, and that the NUL character corresponds to a
 single NUL byte in codeset. So, the codeset may not be, for example,
 "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE",
 "UTF-32LE", "UTF-7"."