Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF

Tomas Kalibera Tue, 31 Jan 2023 01:45:27 -0800

On 1/31/23 01:27, Simon Urbanek wrote:

Tomas,


I think you're not addressing the actual issue which is a clear regression in 
Sys.getenv() [because it used to work and still works for single env var, but 
not a list] and the cryptic error due to that regression (caused by changes in 
R-devel). So in either case, Sys.getenv needs fixing (i.e., this should really 
go to the bugzilla). Its behavior is currently inconsistent.

The quoted Python discussion diverges very quickly into file name discussions, but I 
think that is not relevant here - environment variables are essentially data 
"blobs" that have application-specific meaning. They just chose to drop them 
because they didn't want to make a decision.

Well, yes, dropping such variables is a hack/work-around, but I wouldnot rule that out from consideration. It would be good to have a lookhat even other language runtimes do, but I can understand dropping themfrom getenv() a work-around allowing such environment variables to existand be inherited by child processes, without breaking what happens in R(the language runtime at hand), makes some sense.

The times when we could have (almost) arbitrary bytes in blobs are over,one has to choose and separate the two. Going back to that we allowblobs in multi-byte strings without special treatment is not possible.That's a common thing, not just for R, we can't just "stop checking" asbefore.

R has also the "bytes" encoding which can be used to work withnon-strings (encoding agnostic operations on non-ASCII data), and therehave been a number of improvements that are now ready in R-devel. Intheory, of course, getenv() could allow also to return the results as"bytes" for those who insist they want to treat environment variableswith invalid strings. That would be technically possible, but could notbe made the default: people would have to opt-in for that by an argumentto the function. I am not sure it is worth it, it would not do whatHenrik is asking for, but it is possible and would be a conceptuallysound solution allowing people to use such variables (with newly writtencode for it that cannot just use the values as "normal" strings), or tosave the whole environment profile.

Some of us have spent very long time debating how to go further dealingwith invalid strings, and it will still take time to decide and reach aconsensus. It is a very hard problem with serious consequences andlimitations. In principle one possible solution is to ban invalidstrings fully, completely, don't allow them to be created. Anothertechnically valid position is to allow invalid sequences in UTF-8strings and support a subset of string operations in them, whilethrowing errors in other (probably only after/when/if R can rely on thatUTF-8 is always the native encoding). Another approach discussed waswhat some of the regular expressions do, when invalid stringsautomatically become "bytes", to some extent it is the current behaviorof some regex operations. The improvements for "bytes" encoding handlingso far in R-devel a consequence of what we have already agreed on.

So, it is not hard today to find inconsistencies in R in that somefunctions check for string validity while others don't. It is simplybecause we have not yet gone all the way to a solution to this bigproblem. getenv() is just one of the many.

In case of getenv(), indeed, it is because of how it is implemented, itwraps OS functions and the OS don't require string validity. Btw Windowshave both a single-byte and multi-byte variant of the environmentprofile which may disagree. R is not at the source of this problem, itis the operating systems which couldn't (yet?) find the decision andwhere the ambiguity remains.

In practice, so far, the problem is small, because [1] (reported byHenrik) is behavior due to an obvious design bug in other software, andluckily this is rare.

I don't have a strong opinion on this, but Sys.getenv("FOO") and 
Sys.getenv()[["FOO"]] should not yield two different results. I would argue that if we 
want to make specific checks, we should make them conditional - even if the default is to add them. 
Again, the error is due to the implementation of Sys.getenv() breaking in R-devel, not due to any 
design decision.

The key design decision (and common one) behind is that environmentvariables are strings.


Tomas

Cheers,
Simon

On Jan 31, 2023, at 1:04 PM, Tomas Kalibera <tomas.kalib...@gmail.com> wrote:


On 1/30/23 23:01, Henrik Bengtsson wrote:

/Hello.

SUMMARY:
$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'

$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
[1] "\xff"

BACKGROUND:
I launch R through an Son of Grid Engine (SGE) scheduler, where the R
process is launched on a compute host via  'qrsh', which part of SGE.
Without going into details, 'mpirun' is also involved. Regardless, in
this process, an 'qrsh'-specific environment variable 'QRSH_COMMAND'
is set automatically.  The value of this variable comprise of a string
with \xff (ASCII 255) injected between the words.  This is by design
of SGE [1].  Here is an example of what this environment variable may
look like:

QRSH_COMMAND= 
orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn"

where each \xff is a single byte 255=0xFF=\xFF.


ISSUE:
An environment variable with embedded 0xFF bytes in its value causes
calls to Sys.getenv() to produce an error when running R in a UTF-8
locale. Here is a minimal example on Linux:

$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
Calls: Sys.getenv -> substring
In addition: Warning message:
In regexpr("=", x, fixed = TRUE) :
   input string 134 is invalid in this locale
Execution halted


WORKAROUND:
The workaround is to (1) identify any environment variables with
invalid UTF-8 symbols, and (2) prune or unset those variables before
launching R, e.g. in my SGE case, launching R using:

QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()"

avoid the problem.  Having to unset/modify environment variables
because R doesn't like them, see a bit of an ad-hoc hack to me.  Also,
if you are not aware of this problem, or not a savvy R user, it can be
quite tricky to track down the above error message, especially if
Sys.getenv() is called deep down in some package dependency.


DISCUSSION/SUGGESTION/ASK:
My suggestion would be to make Sys.getenv() robust against any type of
byte values in environment variable strings.

The error occurs in Sys.getenv() from:

         x <- .Internal(Sys.getenv(character(), ""))
         m <- regexpr("=", x, fixed = TRUE)  ## produces a warning
         n <- substring(x, 1L, m - 1L)
         v <- substring(x, m + 1L)  ## produces the error

I know too little about string encodings, so I'm not sure what the
best approach would be here, but maybe falling back to parsing strings
that are invalid in the current locale using the C locale would be
reasonable?  Maybe Sys.getenv() should always use the C locale for
this. It looks like Sys.getenv(name) does this, e.g.

$ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
[1] "\xff"

I'd appreciate any comments and suggestions. I'm happy to file a bug
report on BugZilla, if this is a bug.

This discussion comes from Python: https://bugs.python.org/issue4006
(it says Python skips such environment variables)

The problem of invalid strings in environment variables is a similar to the 
problem of invalid strings in file names. Both variables and file names are 
something people want to use as strings in programs, scripts, texts, but at the 
same time these may in theory not be valid strings. Working with (potentially) 
invalid strings (almost) transparently is much harder than with valid strings; 
even if R decided to do that, it would be hard to implement and take long and 
only work for some operations, most will still throw errors. In addition, in 
practice invalid strings are almost always due to an error, particularly so in 
file names or environment variables. Such errors are often worth catching 
(wrong encoding declaration, etc), even though perhaps not always.

In practice, this instance can only be properly fixed at the source, [1] should 
not do this. split_command() will run into problems with different software, 
not just R.

There should be a way to split the commands in ASCII (using some sort of 
quoting/escaping). Using \xFF is flawed also simply because it may be present 
in the commands, if we followed the same logic of that every byte is fine. So 
the code is buggy even regardless of multi-byte encodings.

Re difficulty to debug, I think the error message is clear and if packages 
catch and hide errors, that'd be bad design of such packages, R couldn't really 
do much about that. This needs to be fixed at [1].

Tomas

Henrik

[1] 
https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Sys.getenv(): Error in substring(x, m + 1L) : invalid multibyte string at '' if an environment variable contains \xFF

Reply via email to