Re: POSIX gettext(): lifetime of returned values
Thank you for the reply. Geoff Clare wrote: > > https://posix.rhansen.org/p/gettext_draft > > Line 357 > > ... > > If temporarily switching a thread's locale through uselocale() > > invalidates the gettext functions' results (even if only those from > > the same thread), it effectively disallows uselocale() as a helper > > function. > > This was discussed in today's call, but we did not reach a conclusion. > > Can you explain how glibc manages not to invalidate strings returned > by gettext() when uselocale() is used to change the locale (without > leaking memory - or does it leak memory?), in particular if codeset > translation was needed. First, let me clarify the term "memory leak". It means [1] that a piece of memory is allocated and held for the rest of the runtime of the process. IMO, it's useful to distinguish bounded and unbounded memory leaks: - A _bounded_ memory leak is one where the amount of leaked memory is bounded by an a-priori computable constant. - An _unbounded_ memory leak is one where such a bound does not exist. Bounded memory leaks are noticeable when a program is run with memory instrumentation, but do not make the program crash (assuming the bound is smaller than the machine's available memory size). Whereas unbounded memory leaks increase the memory size of the process, typically linearly over time, and in the end make the process crash. Bounded memory leaks already exist in a number of places in POSIX: - Most statically allocated caches are bounded memory leaks. - An application that calls setenv() a fixed number of times has a bounded memory leak. - An application that calls dlopen() a fixed number of times has a bounded memory leak. - An application that creates a fixed number of background threads (= threads which persist until exit()) has a bounded memory leak, because each thread consumes memory. The gettext() implementation in glibc, when used with a fixed number of domains, is a *bounded* memory leak. It's bounded, because there are only a certain number of message catalogs (.mo files) that can be loaded, and only a fixed number of possible locale encodings (UTF-8, ISO-8859-1, etc.). Now to your question: > Can you explain how glibc manages not to invalidate strings returned > by gettext() when uselocale() is used to change the locale (without > leaking memory - or does it leak memory?), in particular if codeset > translation was needed. Glibc uses a cache of loaded message objects: GettextCache = Map ( .mo file name --> LoadedMoFile ) LoadedMoFile = { contents of .mo file; MapOfConvertedContents; other data } MapOfConvertedContents = Map ( encoding --> ConvertedContents ) ConvertedContents = { iconv_t iconv_descriptor; hash table/map ( msgid --> converted msgstr ); } Since this GettextCache does not have thread-dependent elements, * Lookups made in one thread speed up also the lookups in other threads. (This is important for speed in multi-threaded applications.) * The result of gettext() in one thread can be used in other threads, with indefinite lifetime. For example, in a GUI application with a "main" thread and an event-handling thread, the main thread can prepare GUI elements with strings returned from gettext() — without copying them through strdup() —, and the event-handling thread can access these GUI elements at any time. * uselocale() has no effect on the GettextCache. This has several consequences: + When the application does s1 = gettext (msgid); uselocale (...); s2 = gettext (msgid); in a way that the locale change also changes the locale encoding, s1 and s2 will be different (because looked up from different ConvertedContents objects from the same LoadedMoFile). + When the application does s1 = gettext (msgid); locale_t old_locale = uselocale (...); ... strtod() / sscanf() calls ... uselocale (old_locale); s2 = gettext (msgid); then, since the locale at the two gettext() calls is the same, s1 and s2 will be the same. + When an application's thread does s1 = gettext (msgid); and another thread does locale_t old_locale = uselocale (...); ... strtod() / sscanf() calls ... uselocale (old_locale); then the first thread can use s1 without caring about the second thread. I hope this explains it: how gettext() can be implemented in a reasonable way, without limiting the use of uselocale(). Bruno [1] https://en.wikipedia.org/wiki/Memory_leak
Re: POSIX gettext(): behaviour if iconv() produces a replacement character
Thank you for the reply. Geoff Clare wrote: > > https://posix.rhansen.org/p/gettext_draft > > Line 350 > > In today's call we made changes along the lines you suggest. Please > check the updated etherpad to see if they achieve what you wanted. The new text achieves what I wanted; thank you. There is a typo, though: a missing closing parenthesis after ``replacement-character''. Bruno
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Thank you for the reply. Geoff Clare wrote: > > https://posix.rhansen.org/p/gettext_draft > > Line 573 > > In today's call we made changes along the lines you suggest. Please > check the updated etherpad to see if they achieve what you wanted. The change is good, from my POV. Thank you. Bruno
Re: POSIX gettext with option -s: handling of \c escape sequence
Thanks for the reply. Geoff Clare wrote: > > This is NOT entirely how the gettext program from GNU gettext behaves. > > Namely, > > it also looks whether some of the strings contain a '\c' sequence, in order > > to > > emulate what BSD 'echo' does: > > > > $ gettext -s -e 'ab\c' | od -t c > > 000 a b > > 002 > > > > Whereas on Solaris, \c is not interpreted: > > > > $ gettext -s -e 'ab\c' | od -t c > > 000 a b c \n > > 004 > > > > How to resolve this? > > In today's call we made changes to allow this handling of \c (using "may", > so it is an implementation option). Please check the updated etherpad to > see if the way it is described there matches how GNU gettext behaves. The updated text is good. GNU gettext will need a small change, in order to accommodate the specified behaviour for the characters that follow '\c', but that is OK since it is rare for users to add more characters after '\c'. Bruno
Re: POSIX gettext(): multithread-safe or not?
Thank you for the reply. https://posix.rhansen.org/p/gettext_draft Line 357 Geoff Clare wrote: > However, we have rearranged the wording in a way that > we hope makes it clearer it is a requirement on implementations. Thank you; it is clearer now. It would be even clearer if there was a paragraph break between the "The application shall ensure …" sentence and the "A subsequent call …" sentence. Bruno
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Geoff Clare wrote in <20220524091849.GC25920@localhost>: |Bruno Haible wrote, on 12 May 2022: |> |> https://posix.rhansen.org/p/gettext_draft |> Line 573 |> |> "The application shall ensure that the codeset argument, if non-empty, \ |> is a |> valid codeset name that can be used as the tocode argument of the \ |> iconv_open() |> function." |> |> This is not the only requirement. We also need the requirement that \ |> the NUL |> character of ASCII maps to a single NUL byte in the codeset. Otherwise \ |> the |> iconv() processing inside gettext() is likely to malfunction. |> |> Suggestion: Change |> "... iconv_open() function." |> to |> "... iconv_open() function, and that the NUL character corresponds to a |> single NUL byte in codeset. So, the codeset may not be, for example, |> "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE"\ |> , |> "UTF-32LE", "UTF-7"." | |In today's call we made changes along the lines you suggest. Please |check the updated etherpad to see if they achieve what you wanted. But can it be any more generic than that in the codeset it specifies, the NUL character corresponds to a single NUL byte. that is the question. I personally never liked gettext(). I just did something with a dictionary, and used block-injecting C preprocessor macros for calls, because the ({ static size_t gen_cnt;.. }) right-hand-side extension never made it into a standard, and it is wasteful to call functions for nothing, especially when the gen_cnt will be set only once and never change in "real life". I find that "setlocale() may invalidate the string" painful, because many functions of the C library do not have _l() variants that could work with a uselocale() object. Just think about the scanf() that is used so often, or strtol(): you cannot even convert a number by standard means. If i were to design this, i would center on bindtextdomain(), and just keep it going. That is of course easier said than done, as only existing behaviour is streamlined and standardized. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: POSIX and restrict
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Lines 163..230, 538..543 > > The 'restrict' keywords in these declarations are useless and - worse - > forbid some valid, useful calls. For example, there is nothing wrong > with >dgettext("hello", "hello") > which will attempt to search for a translation of "hello" in a catalog > name hello.mo. There is also no imaginable optimization that can be done > in the implementation of dgettext() by assuming that the two arguments > were different. > > 'restrict' is meaningful when at least one of the parameters is a > writable pointer type. Here, all parameters are either non-pointers > or read-only pointers. > > Suggestion: Remove every 'restrict' in these declarations. In yesterday's call we removed "restrict" everywhere in the etherpad. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX gettext(): choosing the domain name
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Line 50 > "often named after the application that provides the collection" > > Issue: On my system, in /usr/share/locale/de/LC_MESSAGES/ there are > 55 .mo files for libraries. > > Suggestion: Change > "after the application" > -> > "after the application or library" In yesterday's call we made this change and also added "or libraries" at the end of the sentence. [I mistakenly said "today's call" in some earlier mails.] -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX gettext(): behaviour if iconv() produces a replacement character
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Line 350 > > "If a significant proportion of the converted message string would consist > of characters resulting from non-identical conversions ..." > > The term "significant proportion" is undefined. > > Suggestion: Change > "If a significant proportion of the converted message string would consist > of characters resulting from non-identical conversions that do not provide > any information about the character they were converted from (for example, > if the converted message string would be mostly or > characters)" > to > "If at least one of the non-identical conversions produces a fallback > character (such as or , depending > on implementation)" > > Rationale: There is no point in forcing gettext() to accept the converted > string when it has low quality. In today's call we made changes along the lines you suggest. Please check the updated etherpad to see if they achieve what you wanted. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX bind_textdomain_codeset(): some invalid codeset arguments
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Line 573 > > "The application shall ensure that the codeset argument, if non-empty, is a > valid codeset name that can be used as the tocode argument of the > iconv_open() > function." > > This is not the only requirement. We also need the requirement that the NUL > character of ASCII maps to a single NUL byte in the codeset. Otherwise the > iconv() processing inside gettext() is likely to malfunction. > > Suggestion: Change > "... iconv_open() function." > to > "... iconv_open() function, and that the NUL character corresponds to a > single NUL byte in codeset. So, the codeset may not be, for example, > "UCS-2", "UTF-16", "UTF-16BE", "UTF-16LE", "UCS-4", "UTF-32", "UTF-32BE", > "UTF-32LE", "UTF-7"." In today's call we made changes along the lines you suggest. Please check the updated etherpad to see if they achieve what you wanted. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX gettext with option -s: handling of \c escape sequence
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Lines 699, 721 > > "if the -n option is not specified, a shall be written after the > last message string" > "(if -n is not also specified) append a to the output." > > This is NOT entirely how the gettext program from GNU gettext behaves. Namely, > it also looks whether some of the strings contain a '\c' sequence, in order to > emulate what BSD 'echo' does: > > $ gettext -s -e 'ab\c' | od -t c > 000 a b > 002 > > Whereas on Solaris, \c is not interpreted: > > $ gettext -s -e 'ab\c' | od -t c > 000 a b c \n > 004 > > How to resolve this? In today's call we made changes to allow this handling of \c (using "may", so it is an implementation option). Please check the updated etherpad to see if the way it is described there matches how GNU gettext behaves. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England
Re: POSIX gettext(): multithread-safe or not?
Bruno Haible wrote, on 12 May 2022: > > https://posix.rhansen.org/p/gettext_draft > Line 357 > > "The returned string shall not be ... invalidated by a subsequent call > to a gettext family function." This was discussed in yesterday's call. > It is not clear whether this sentence is an assertion (regarding how the > gettext() implementation behaves) or a requirement/restriction w.r.t. the > application. If it was a requirement on the application it would be worded as "The application shall ensure ..." like the first sentence in that paragraph. However, we have rearranged the wording in a way that we hope makes it clearer it is a requirement on implementations. > In the latter case, the consequences of this restriction would be: > 1) Multithreaded applications cannot use gettext, except during > initialization when only one thread exists. > 2) Libraries cannot use gettext, otherwise multithreaded applications > cannot make use of them. And *many* applications are multithreaded > nowadays. The requirement has nothing to do with multithreading. All functions in POSIX.1 are required to be thread-safe except where explicitly stated otherwise, and there is no exception for gettext stated in the etherpad. The requirement is intended to forbid the use of a thread-local static buffer to store the returned string. -- Geoff Clare The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England