Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF

Tomas Kalibera Tue, 31 Jan 2023 03:52:20 -0800


On 1/31/23 11:50, Martin Maechler wrote:

Tomas Kalibera
     on Tue, 31 Jan 2023 10:53:21 +0100 writes:

     > On 1/31/23 09:48, Ivan Krylov wrote:
     >> Can we use the "bytes" encoding for such environment variables invalid
     >> in the current locale? The following patch preserves CE_NATIVE for
     >> strings valid in the current UTF-8 or multibyte locale (or
     >> non-multibyte strings) but sets CE_BYTES for those that are invalid:
     >>
     >> Index: src/main/sysutils.c
     >> ===================================================================
     >> --- src/main/sysutils.c   (revision 83731)
.....
     >>
     >> Here are the potential problems with this approach:
     >>
     >> * I don't know whether known_to_be_utf8 can disagree with utf8locale.
     >> known_to_be_utf8 was the original condition for setting CE_UTF8 on
     >> the string. I also need to detect non-UTF-8 multibyte locales, so
     >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more
     >> careful and test for (enc == CE_UTF8) || (utf8locale && enc ==
     >> CE_NATIVE) instead of just utf8locale.
     >>
     >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid
     >> strings in the environment with this patch applied, but now
     >> print.Dlist does, because formatDL wants to compute the width of the
     >> string which has the 'bytes' encoding. If this is a good way to
     >> solve the problem, I can work on suggesting a fix for formatDL to
     >> avoid the error.

     > Thanks, indeed, type instability is a big problem of the approach "turn
     > invalid strings to bytes". It is something what is historically being
     > done in regular expression operations, but it is brittle and not user
     > friendly: writing code to be agnostic to whether we are dealing with
     > "bytes" or a regular string is very tedious. Pieces of type instability
     > come also from that ASCII strings are always flagged "native" (never
     > "bytes"). Last year I had to revert a big change which broke existing
     > code by introducing some more of this instability due to better dealing
     > with invalid strings in regular expressions. I've made some additions to
     > R-devel allowing to better deal with such instability but it is still a
     > pain and existing code has to be changed (and made more complicated).

     > So, I don't think this is the way to go.

     > Tomas

hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
but of course you are right that type stability is really a
valid goal....

In general, what about behaving close to "old R" and replace all such
strings by  NA_character_  (and typically raising one warning)?
This would keep the result a valid character vector, just with some NA entries.

Specifically for  Sys.getenv(),  I still think Simon has a very
valid point of "requiring" (of our design) that
Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
Sys.getenv("BOOM")

Also, as typical R user, I'd definitely want to be able to get all the valid
environment variables, even if there are one or more invalid
ones. ... and similarly in other cases, it may be a cheap
strategy to replace invalid strings ("string" in the sense of
length 1 STRSXP, i.e., in R, a "character" of length 1) by
NA_character_  and keep all valid parts of the character vector
in a valid encoding.

In case of specifically getenv(), yes, we could return NA for variablescontaining invalid strings, both when obtaining a value for a singlevariable and for multiple, partially matching undocumented andunintentional behavior R had before 4.1, and making getenv(var) andgetenv()[[var]] consistent even with invalid strings. Once we decide onhow to deal with invalid strings in general, we can change this againaccordingly, breaking code for people who depend on these things (but sofar I only know about this one case). Perhaps this would be a logicalalternative to the Python approach that would be more suitable for R(given we have NAs and given that we happened to have that somewhatsimilar alternative before). Conceptually it is about the same thing asomitting the variable in Python: R users would not be able to use suchvariables, but if they don't touch them, they could be inherited tochild processes, etc.

- - - - - -

In addition to the above cheap "replace by NA"-strategy,
and at least once R is "all UTF-8", we could
also consider a more expensive strategy that would try to
replace invalid characters/byte-sequences by one specific valid UTF-8
character, i.e., glyph (think of a special version of "?") that
we would designate as replacement for "invalid-in-current-encoding".

Probably such a glyph already exists and we have seen it used in
some console's output when having to print such things.

This is part of the big discussion where we couldn't find a consensus,and I think this happened so recently that it is too early for openingit again. Also, we are not yet to where we could assume UTF-8 is alwaysthe native encoding. As you know my currently preferred solution is toallow invalid portions of UTF-8 and treat them in individual functions:often one can proceed with invalid UTF-8 bytes as usual (e.g. copying,concatenation, search), in some function one has to use a specialalgorithm (e.g. regular expressions, PCRE2 supports that, in some casesone would substitute characters, in some cases one would throw anerror). As strings are immutable in R, validity checks could be doneonce for each string. This would in many cases allow for UTF-8 stringswith bits of errors to "go through" R without triggering errors. Wedidn't have such options with other multi-byte encodings, but UTF-8 isdesigned so that this is possible.

Substituting by various escapes "<U+..>", "??", etc, is something wealready do in R in some cases, but it cannot be the default mechanismfor dealing with invalid strings, e.g. file names are a prime example ofwhere it doesn't work and indeed there are libraries for dealing withfile names as special entities.

As you probably know part of the problem is also that invalid stringsthat are "tolerated" by OSes, file names and environment variables, can"originally" be in different "encodings". Sorry for these many quotes,but simplifying a lot on Unix file systems often uses byte-sequences forfile names (even though applications often assume it is UTF-8) and onWindows sequences of 16-bit words (sometimes understood as UCS-2, butnot necessarily valid UTF-16). With environment variables, this came sofar that you can have a narrow and wide version which may not directlycorrespond (and I've not investigated the details yet, see documentationfor _wenviron etc). In R, particularly on Windows, we can never getcompletely rid of the UTF-16 and the wide character interface, so insome cases, we would want to "convert invalid strings" betweenencodings, which of course doesn't make sense and most likely will bethe case where invalid strings would cause errors. These things are socomplicated that, while it is not my currently preferred solution, Iunderstand the position that we ban invalid strings fully and keep thembanned in R even when UTF-8 is the only supported native encoding.


Tomas

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF

Reply via email to