>>>>> Tomas Kalibera >>>>> on Tue, 31 Jan 2023 10:53:21 +0100 writes:
> On 1/31/23 09:48, Ivan Krylov wrote: >> Can we use the "bytes" encoding for such environment variables invalid >> in the current locale? The following patch preserves CE_NATIVE for >> strings valid in the current UTF-8 or multibyte locale (or >> non-multibyte strings) but sets CE_BYTES for those that are invalid: >> >> Index: src/main/sysutils.c >> =================================================================== >> --- src/main/sysutils.c (revision 83731) ..... >> >> Here are the potential problems with this approach: >> >> * I don't know whether known_to_be_utf8 can disagree with utf8locale. >> known_to_be_utf8 was the original condition for setting CE_UTF8 on >> the string. I also need to detect non-UTF-8 multibyte locales, so >> I'm checking for utf8locale and mbcslocale. Perhaps I should be more >> careful and test for (enc == CE_UTF8) || (utf8locale && enc == >> CE_NATIVE) instead of just utf8locale. >> >> * I have verified that Sys.getenv() doesn't crash with UTF-8-invalid >> strings in the environment with this patch applied, but now >> print.Dlist does, because formatDL wants to compute the width of the >> string which has the 'bytes' encoding. If this is a good way to >> solve the problem, I can work on suggesting a fix for formatDL to >> avoid the error. > Thanks, indeed, type instability is a big problem of the approach "turn > invalid strings to bytes". It is something what is historically being > done in regular expression operations, but it is brittle and not user > friendly: writing code to be agnostic to whether we are dealing with > "bytes" or a regular string is very tedious. Pieces of type instability > come also from that ASCII strings are always flagged "native" (never > "bytes"). Last year I had to revert a big change which broke existing > code by introducing some more of this instability due to better dealing > with invalid strings in regular expressions. I've made some additions to > R-devel allowing to better deal with such instability but it is still a > pain and existing code has to be changed (and made more complicated). > So, I don't think this is the way to go. > Tomas hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy, but of course you are right that type stability is really a valid goal.... In general, what about behaving close to "old R" and replace all such strings by NA_character_ (and typically raising one warning)? This would keep the result a valid character vector, just with some NA entries. Specifically for Sys.getenv(), I still think Simon has a very valid point of "requiring" (of our design) that Sys.getenv()[["BOOM"]] {double `[[`} should be the same as Sys.getenv("BOOM") Also, as typical R user, I'd definitely want to be able to get all the valid environment variables, even if there are one or more invalid ones. ... and similarly in other cases, it may be a cheap strategy to replace invalid strings ("string" in the sense of length 1 STRSXP, i.e., in R, a "character" of length 1) by NA_character_ and keep all valid parts of the character vector in a valid encoding. - - - - - - In addition to the above cheap "replace by NA"-strategy, and at least once R is "all UTF-8", we could also consider a more expensive strategy that would try to replace invalid characters/byte-sequences by one specific valid UTF-8 character, i.e., glyph (think of a special version of "?") that we would designate as replacement for "invalid-in-current-encoding". Probably such a glyph already exists and we have seen it used in some console's output when having to print such things. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel