Tomas,

I think you're not addressing the actual issue which is a clear regression in 
Sys.getenv() [because it used to work and still works for single env var, but 
not a list] and the cryptic error due to that regression (caused by changes in 
R-devel). So in either case, Sys.getenv needs fixing (i.e., this should really 
go to the bugzilla). Its behavior is currently inconsistent.

The quoted Python discussion diverges very quickly into file name discussions, 
but I think that is not relevant here - environment variables are essentially 
data "blobs" that have application-specific meaning. They just chose to drop 
them because they didn't want to make a decision.

I don't have a strong opinion on this, but Sys.getenv("FOO") and 
Sys.getenv()[["FOO"]] should not yield two different results. I would argue 
that if we want to make specific checks, we should make them conditional - even 
if the default is to add them. Again, the error is due to the implementation of 
Sys.getenv() breaking in R-devel, not due to any design decision.

Cheers,
Simon


> On Jan 31, 2023, at 1:04 PM, Tomas Kalibera <tomas.kalib...@gmail.com> wrote:
> 
> 
> On 1/30/23 23:01, Henrik Bengtsson wrote:
>> /Hello.
>> 
>> SUMMARY:
>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
>> 
>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
>> [1] "\xff"
>> 
>> BACKGROUND:
>> I launch R through an Son of Grid Engine (SGE) scheduler, where the R
>> process is launched on a compute host via  'qrsh', which part of SGE.
>> Without going into details, 'mpirun' is also involved. Regardless, in
>> this process, an 'qrsh'-specific environment variable 'QRSH_COMMAND'
>> is set automatically.  The value of this variable comprise of a string
>> with \xff (ASCII 255) injected between the words.  This is by design
>> of SGE [1].  Here is an example of what this environment variable may
>> look like:
>> 
>> QRSH_COMMAND= 
>> orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn"
>> 
>> where each \xff is a single byte 255=0xFF=\xFF.
>> 
>> 
>> ISSUE:
>> An environment variable with embedded 0xFF bytes in its value causes
>> calls to Sys.getenv() to produce an error when running R in a UTF-8
>> locale. Here is a minimal example on Linux:
>> 
>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()"
>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>'
>> Calls: Sys.getenv -> substring
>> In addition: Warning message:
>> In regexpr("=", x, fixed = TRUE) :
>>   input string 134 is invalid in this locale
>> Execution halted
>> 
>> 
>> WORKAROUND:
>> The workaround is to (1) identify any environment variables with
>> invalid UTF-8 symbols, and (2) prune or unset those variables before
>> launching R, e.g. in my SGE case, launching R using:
>> 
>> QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()"
>> 
>> avoid the problem.  Having to unset/modify environment variables
>> because R doesn't like them, see a bit of an ad-hoc hack to me.  Also,
>> if you are not aware of this problem, or not a savvy R user, it can be
>> quite tricky to track down the above error message, especially if
>> Sys.getenv() is called deep down in some package dependency.
>> 
>> 
>> DISCUSSION/SUGGESTION/ASK:
>> My suggestion would be to make Sys.getenv() robust against any type of
>> byte values in environment variable strings.
>> 
>> The error occurs in Sys.getenv() from:
>> 
>>         x <- .Internal(Sys.getenv(character(), ""))
>>         m <- regexpr("=", x, fixed = TRUE)  ## produces a warning
>>         n <- substring(x, 1L, m - 1L)
>>         v <- substring(x, m + 1L)  ## produces the error
>> 
>> I know too little about string encodings, so I'm not sure what the
>> best approach would be here, but maybe falling back to parsing strings
>> that are invalid in the current locale using the C locale would be
>> reasonable?  Maybe Sys.getenv() should always use the C locale for
>> this. It looks like Sys.getenv(name) does this, e.g.
>> 
>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')"
>> [1] "\xff"
>> 
>> I'd appreciate any comments and suggestions. I'm happy to file a bug
>> report on BugZilla, if this is a bug.
> This discussion comes from Python: https://bugs.python.org/issue4006
> (it says Python skips such environment variables)
> 
> The problem of invalid strings in environment variables is a similar to the 
> problem of invalid strings in file names. Both variables and file names are 
> something people want to use as strings in programs, scripts, texts, but at 
> the same time these may in theory not be valid strings. Working with 
> (potentially) invalid strings (almost) transparently is much harder than with 
> valid strings; even if R decided to do that, it would be hard to implement 
> and take long and only work for some operations, most will still throw 
> errors. In addition, in practice invalid strings are almost always due to an 
> error, particularly so in file names or environment variables. Such errors 
> are often worth catching (wrong encoding declaration, etc), even though 
> perhaps not always.
> 
> In practice, this instance can only be properly fixed at the source, [1] 
> should not do this. split_command() will run into problems with different 
> software, not just R.
> 
> There should be a way to split the commands in ASCII (using some sort of 
> quoting/escaping). Using \xFF is flawed also simply because it may be present 
> in the commands, if we followed the same logic of that every byte is fine. So 
> the code is buggy even regardless of multi-byte encodings.
> 
> Re difficulty to debug, I think the error message is clear and if packages 
> catch and hide errors, that'd be bad design of such packages, R couldn't 
> really do much about that. This needs to be fixed at [1].
> 
> Tomas
> 
>> 
>> Henrik
>> 
>> [1] 
>> https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466
>> 
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to