Re: POSIX gettext(): lifetime of returned values

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group
Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 357
> > ...
> > If temporarily switching a thread's locale through uselocale()
> > invalidates the gettext functions' results (even if only those from
> > the same thread), it effectively disallows uselocale() as a helper
> > function.
> 
> This was discussed in today's call, but we did not reach a conclusion.
> 
> Can you explain how glibc manages not to invalidate strings returned
> by gettext() when uselocale() is used to change the locale (without
> leaking memory - or does it leak memory?), in particular if codeset
> translation was needed.

First, let me clarify the term "memory leak". It means [1] that a piece
of memory is allocated and held for the rest of the runtime of the process.

IMO, it's useful to distinguish bounded and unbounded memory leaks:
  - A _bounded_ memory leak is one where the amount of leaked memory is
bounded by an a-priori computable constant.
  - An _unbounded_ memory leak is one where such a bound does not exist.

Bounded memory leaks are noticeable when a program is run with memory
instrumentation, but do not make the program crash (assuming the bound
is smaller than the machine's available memory size).

Whereas unbounded memory leaks increase the memory size of the process,
typically linearly over time, and in the end make the process crash.

Bounded memory leaks already exist in a number of places in POSIX:
  - Most statically allocated caches are bounded memory leaks.
  - An application that calls setenv() a fixed number of times has a
bounded memory leak.
  - An application that calls dlopen() a fixed number of times has a
bounded memory leak.
  - An application that creates a fixed number of background threads
(= threads which persist until exit()) has a bounded memory leak,
because each thread consumes memory.

The gettext() implementation in glibc, when used with a fixed number of
domains, is a *bounded* memory leak. It's bounded, because there are only
a certain number of message catalogs (.mo files) that can be loaded, and
only a fixed number of possible locale encodings (UTF-8, ISO-8859-1, etc.).

Now to your question:

> Can you explain how glibc manages not to invalidate strings returned
> by gettext() when uselocale() is used to change the locale (without
> leaking memory - or does it leak memory?), in particular if codeset
> translation was needed.

Glibc uses a cache of loaded message objects:

  GettextCache = Map ( .mo file name --> LoadedMoFile )

  LoadedMoFile = {
   contents of .mo file;
   MapOfConvertedContents;
   other data
 }

  MapOfConvertedContents = Map ( encoding --> ConvertedContents )

  ConvertedContents = {
iconv_t iconv_descriptor;
hash table/map ( msgid --> converted msgstr );
  }

Since this GettextCache does not have thread-dependent elements,

  * Lookups made in one thread speed up also the lookups in other threads.
(This is important for speed in multi-threaded applications.)

  * The result of gettext() in one thread can be used in other threads,
with indefinite lifetime.

For example, in a GUI application with a "main" thread and an
event-handling thread, the main thread can prepare GUI elements
with strings returned from gettext() — without copying them through
strdup() —, and the event-handling thread can access these GUI
elements at any time.

  * uselocale() has no effect on the GettextCache. This has several
consequences:

+ When the application does
   s1 = gettext (msgid);
   uselocale (...);
   s2 = gettext (msgid);
  in a way that the locale change also changes the locale encoding,
  s1 and s2 will be different (because looked up from different
  ConvertedContents objects from the same LoadedMoFile).

+ When the application does
   s1 = gettext (msgid);
   locale_t old_locale = uselocale (...);
   ... strtod() / sscanf() calls ...
   uselocale (old_locale);
   s2 = gettext (msgid);
  then, since the locale at the two gettext() calls is the same,
  s1 and s2 will be the same.

+ When an application's thread does
   s1 = gettext (msgid);
  and another thread does
   locale_t old_locale = uselocale (...);
   ... strtod() / sscanf() calls ...
   uselocale (old_locale);
  then the first thread can use s1 without caring about the second
  thread.

I hope this explains it: how gettext() can be implemented in a reasonable
way, without limiting the use of uselocale().

Bruno

[1] https://en.wikipedia.org/wiki/Memory_leak






Re: POSIX gettext(): behaviour if iconv() produces a replacement character

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group
Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 350
>
> In today's call we made changes along the lines you suggest. Please
> check the updated etherpad to see if they achieve what you wanted.

The new text achieves what I wanted; thank you.
There is a typo, though: a missing closing parenthesis after
``replacement-character''.

Bruno





Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group
Thank you for the reply.

Geoff Clare wrote:
> > https://posix.rhansen.org/p/gettext_draft
> > Line 573
> 
> In today's call we made changes along the lines you suggest. Please
> check the updated etherpad to see if they achieve what you wanted.

The change is good, from my POV. Thank you.

Bruno





Re: POSIX gettext with option -s: handling of \c escape sequence

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group
Thanks for the reply.

Geoff Clare wrote:
> > This is NOT entirely how the gettext program from GNU gettext behaves. 
> > Namely,
> > it also looks whether some of the strings contain a '\c' sequence, in order 
> > to
> > emulate what BSD 'echo' does:
> > 
> > $ gettext -s -e 'ab\c' | od -t c
> > 000   a   b
> > 002
> > 
> > Whereas on Solaris, \c is not interpreted:
> > 
> > $ gettext -s -e 'ab\c' | od -t c
> > 000   a   b   c  \n
> > 004
> > 
> > How to resolve this?
> 
> In today's call we made changes to allow this handling of \c (using "may",
> so it is an implementation option).  Please check the updated etherpad to
> see if the way it is described there matches how GNU gettext behaves.

The updated text is good. GNU gettext will need a small change, in order
to accommodate the specified behaviour for the characters that follow '\c',
but that is OK since it is rare for users to add more characters after '\c'.

Bruno





Re: POSIX gettext(): multithread-safe or not?

2022-05-24 Thread Bruno Haible via austin-group-l at The Open Group
Thank you for the reply.

https://posix.rhansen.org/p/gettext_draft
Line 357

Geoff Clare wrote:
> However, we have rearranged the wording in a way that
> we hope makes it clearer it is a requirement on implementations.

Thank you; it is clearer now. It would be even clearer if there was a
paragraph break between the "The application shall ensure …" sentence
and the "A subsequent call …" sentence.

Bruno






Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Steffen Nurpmeso via austin-group-l at The Open Group
Geoff Clare wrote in
 <20220524091849.GC25920@localhost>:
 |Bruno Haible wrote, on 12 May 2022:
 |>
 |> https://posix.rhansen.org/p/gettext_draft
 |> Line 573
 |> 
 |> "The application shall ensure that the codeset argument, if non-empty, \
 |> is a
 |>  valid codeset name that can be used as the tocode argument of the \
 |>  iconv_open()
 |>  function."
 |> 
 |> This is not the only requirement. We also need the requirement that \
 |> the NUL
 |> character of ASCII maps to a single NUL byte in the codeset. Otherwise \
 |> the
 |> iconv() processing inside gettext() is likely to malfunction.
 |> 
 |> Suggestion: Change
 |> "... iconv_open() function."
 |> to
 |> "... iconv_open() function, and that the NUL character corresponds to a
 |>  single NUL byte in codeset. So, the codeset may not be, for example,
 |>  "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE"\
 |>  ,
 |>  "UTF-32LE", "UTF-7"."
 |
 |In today's call we made changes along the lines you suggest. Please
 |check the updated etherpad to see if they achieve what you wanted.

But can it be any more generic than

  that in the codeset it specifies, the NUL character corresponds
  to a single NUL byte.

that is the question.

  I personally never liked gettext().  I just did something with
  a dictionary, and used block-injecting C preprocessor macros for
  calls, because the ({ static size_t gen_cnt;.. })
  right-hand-side extension never made it into a standard, and it
  is wasteful to call functions for nothing, especially when the
  gen_cnt will be set only once and never change in "real life".

  I find that "setlocale() may invalidate the string" painful,
  because many functions of the C library do not have _l() variants
  that could work with a uselocale() object.  Just think about the
  scanf() that is used so often, or strtol(): you cannot even
  convert a number by standard means.
  If i were to design this, i would center on bindtextdomain(),
  and just keep it going.
  That is of course easier said than done, as only existing
  behaviour is streamlined and standardized.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: POSIX and restrict

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Lines 163..230, 538..543
> 
> The 'restrict' keywords in these declarations are useless and - worse -
> forbid some valid, useful calls. For example, there is nothing wrong
> with
>dgettext("hello", "hello")
> which will attempt to search for a translation of "hello" in a catalog
> name hello.mo. There is also no imaginable optimization that can be done
> in the implementation of dgettext() by assuming that the two arguments
> were different.
> 
> 'restrict' is meaningful when at least one of the parameters is a
> writable pointer type. Here, all parameters are either non-pointers
> or read-only pointers.
> 
> Suggestion: Remove every 'restrict' in these declarations.

In yesterday's call we removed "restrict" everywhere in the etherpad.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX gettext(): choosing the domain name

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Line 50
> "often named after the application that provides the collection"
> 
> Issue: On my system, in /usr/share/locale/de/LC_MESSAGES/ there are
> 55 .mo files for libraries.
> 
> Suggestion: Change
> "after the application"
> ->
> "after the application or library"

In yesterday's call we made this change and also added "or libraries" at
the end of the sentence.

[I mistakenly said "today's call" in some earlier mails.]

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX gettext(): behaviour if iconv() produces a replacement character

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Line 350
> 
> "If a significant proportion of the converted message string would consist
>  of characters resulting from non-identical conversions ..."
> 
> The term "significant proportion" is undefined.
> 
> Suggestion: Change
> "If a significant proportion of the converted message string would consist
>  of characters resulting from non-identical conversions that do not provide
>  any information about the character they were converted from (for example,
>  if the converted message string would be mostly  or
>   characters)"
> to
> "If at least one of the non-identical conversions produces a fallback
>  character (such as  or , depending
>  on implementation)"
> 
> Rationale: There is no point in forcing gettext() to accept the converted
> string when it has low quality.

In today's call we made changes along the lines you suggest. Please
check the updated etherpad to see if they achieve what you wanted.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Line 573
> 
> "The application shall ensure that the codeset argument, if non-empty, is a
>  valid codeset name that can be used as the tocode argument of the 
> iconv_open()
>  function."
> 
> This is not the only requirement. We also need the requirement that the NUL
> character of ASCII maps to a single NUL byte in the codeset. Otherwise the
> iconv() processing inside gettext() is likely to malfunction.
> 
> Suggestion: Change
> "... iconv_open() function."
> to
> "... iconv_open() function, and that the NUL character corresponds to a
>  single NUL byte in codeset. So, the codeset may not be, for example,
>  "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE",
>  "UTF-32LE", "UTF-7"."

In today's call we made changes along the lines you suggest. Please
check the updated etherpad to see if they achieve what you wanted.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX gettext with option -s: handling of \c escape sequence

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Lines 699, 721
> 
> "if the -n option is not specified, a  shall be written after the
>  last message string"
> "(if -n is not also specified) append a  to the output."
> 
> This is NOT entirely how the gettext program from GNU gettext behaves. Namely,
> it also looks whether some of the strings contain a '\c' sequence, in order to
> emulate what BSD 'echo' does:
> 
> $ gettext -s -e 'ab\c' | od -t c
> 000   a   b
> 002
> 
> Whereas on Solaris, \c is not interpreted:
> 
> $ gettext -s -e 'ab\c' | od -t c
> 000   a   b   c  \n
> 004
> 
> How to resolve this?

In today's call we made changes to allow this handling of \c (using "may",
so it is an implementation option).  Please check the updated etherpad to
see if the way it is described there matches how GNU gettext behaves.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England



Re: POSIX gettext(): multithread-safe or not?

2022-05-24 Thread Geoff Clare via austin-group-l at The Open Group
Bruno Haible wrote, on 12 May 2022:
>
> https://posix.rhansen.org/p/gettext_draft
> Line 357
> 
> "The returned string shall not be ... invalidated by a subsequent call
>  to a gettext family function."

This was discussed in yesterday's call.

> It is not clear whether this sentence is an assertion (regarding how the
> gettext() implementation behaves) or a requirement/restriction w.r.t. the
> application.

If it was a requirement on the application it would be worded as
"The application shall ensure ..." like the first sentence in that
paragraph.  However, we have rearranged the wording in a way that
we hope makes it clearer it is a requirement on implementations.

> In the latter case, the consequences of this restriction would be:
>   1) Multithreaded applications cannot use gettext, except during
>  initialization when only one thread exists.
>   2) Libraries cannot use gettext, otherwise multithreaded applications
>  cannot make use of them. And *many* applications are multithreaded
>  nowadays.

The requirement has nothing to do with multithreading. All functions
in POSIX.1 are required to be thread-safe except where explicitly
stated otherwise, and there is no exception for gettext stated in the
etherpad.

The requirement is intended to forbid the use of a thread-local
static buffer to store the returned string.

-- 
Geoff Clare 
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England