Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF

peter dalgaard Tue, 31 Jan 2023 05:37:36 -0800

> On 31 Jan 2023, at 12:51 , Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
> 
> 
> On 1/31/23 11:50, Martin Maechler wrote:
<snippage>
>> hmm.., that's a pity; I had hoped it was a pragmatic and valid strategy,
>> but of course you are right that type stability is really a
>> valid goal....
>> 
>> In general, what about behaving close to "old R" and replace all such
>> strings by  NA_character_  (and typically raising one warning)?
>> This would keep the result a valid character vector, just with some NA 
>> entries.
>> 
>> Specifically for  Sys.getenv(),  I still think Simon has a very
>> valid point of "requiring" (of our design) that
>> Sys.getenv()[["BOOM"]]  {double `[[`} should be the same as
>> Sys.getenv("BOOM")
>> 
>> Also, as typical R user, I'd definitely want to be able to get all the valid
>> environment variables, even if there are one or more invalid
>> ones. ... and similarly in other cases, it may be a cheap
>> strategy to replace invalid strings ("string" in the sense of
>> length 1 STRSXP, i.e., in R, a "character" of length 1) by
>> NA_character_  and keep all valid parts of the character vector
>> in a valid encoding.
> In case of specifically getenv(), yes, we could return NA for variables 
> containing invalid strings, both when obtaining a value for a single variable 
> and for multiple, partially matching undocumented and unintentional behavior 
> R had before 4.1, and making getenv(var) and getenv()[[var]] consistent even 
> with invalid strings.  Once we decide on how to deal with invalid strings in 
> general, we can change this again accordingly, breaking code for people who 
> depend on these things (but so far I only know about this one case). Perhaps 
> this would be a logical alternative to the Python approach that would be more 
> suitable for R (given we have NAs and given that we happened to have that 
> somewhat similar alternative before). Conceptually it is about the same thing 
> as omitting the variable in Python: R users would not be able to use such 
> variables, but if they don't touch them, they could be inherited to child 
> processes, etc.
<more snippage>

Hum, I'm out of my waters here, but offhand I would be wary about approaches 
that lead to loss of information. Presumably someone will sooner or later 
actually want to deal with the content of an environment variable with invalid 
bytes inside. I.e. it would be preferable to keep the content and mark the 
encoding as something not-multibyte.

In fact this is almost what happens (for me...) if I just add Encoding(x) <- 
"bytes" for the return value of .Internal(Sys.getenv(character(), "")):

> Sys.getenv()[["BOOM"]]
[1] "\\xff"
> Encoding(Sys.getenv())
 [1] "unknown" "unknown" "bytes"   "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
...

but I suppose that breaks if I have environment variables that actually _are_ 
utf8, because only plain-ASCII becomes "unknown"? And nchar(Sys.getenv()) also 
does not work.

(And of course I agree that the QRSH thing is Just Wrong; people using 0xff as 
a separator between utf8 strings deserve the same fate as those who used comma 
separation between numbers with decimal commas.)

-pd

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF

Reply via email to