Ah, sorry, this is my bad - I was trying on an older version of R and it seemed to work, but it was apparently my fault for not setting the environment variable correctly in that test, so I take the regression comment back (I had the recent encoding changes in mind so). I still think it may be worth thinking about what we want it to do - the inconsistency is still true:
> Sys.getenv("BOOM") [1] "\xff" > Sys.getenv()[["BOOM"]] Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' In addition: Warning message: In regexpr("=", x, fixed = TRUE) : input string 2 is invalid in this locale so the rest of my comments stand ;). Thanks, Simon > On Jan 31, 2023, at 2:11 PM, Henrik Bengtsson <henrik.bengts...@gmail.com> > wrote: > > Thanks for the quick replies. > > For me, the main issue is that the value of an environment variable > can causes errors in R. Say that there's an environment variable that > is used as a raw data blob (as Simon mentions) and where that data is > flavor of the day. Then, this would affect R in a random, > non-predictable way. Unless it can be argued that environment > variables must not contain random bytes, I argue that R needs to be > robust against this. If not, we're effectively asking everyone to use > tryCatch() whenever calling Sys.getenv(). Even so, they wouldn't not > be able to query *all* environment variables, or even know which they > are, because Sys.getenv() gives an error. So, I think the only > reasonable thing to do is to make Sys.getenv() robust against this. > > One alternative is to have Sys.getenv() (1) detect for which > environment variables this happens, (2) produce an informative warning > that R cannot represent the value correctly, and (3) either drop it, > or return an adjusted value (e.g. NA_character_). Something like: > >> envs <- Sys.getenv() > Warning message: > Environment variable 'BOOM' was dropped, because its > value cannot be represented as a string in R > > As you see from my examples below, R used to drop such environment > variables in R (< 4.1.0). > > Simon, your comment saying it used to work, maybe me look back at > older versions of R. Although I didn't say so, I used R 4.2.2 in my > original report. It looks like it was in R 4.1.0 that this started to > fail. I get: > > # R 2.15.0 - R 3.1.0 > $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']" > <NA> > NA > Warning message: > In strsplit(.Internal(Sys.getenv(character(), "")), "=", fixed = TRUE) : > input string 8 is invalid in this locale > > # R 3.2.0 - R 4.0.4 > $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']" > NA NA > Warning message: > In strsplit(.Internal(Sys.getenv(character(), "")), "=", fixed = TRUE) : > input string 137 is invalid in this locale > > # R 4.1.0 - ... > $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()['BOOM']" > Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' > Calls: Sys.getenv -> substring > In addition: Warning message: > In regexpr("=", x, fixed = TRUE) : > input string 137 is invalid in this locale > Execution halted > > /Henrik > > On Mon, Jan 30, 2023 at 4:27 PM Simon Urbanek > <simon.urba...@r-project.org> wrote: >> >> Tomas, >> >> I think you're not addressing the actual issue which is a clear regression >> in Sys.getenv() [because it used to work and still works for single env var, >> but not a list] and the cryptic error due to that regression (caused by >> changes in R-devel). So in either case, Sys.getenv needs fixing (i.e., this >> should really go to the bugzilla). Its behavior is currently inconsistent. >> >> The quoted Python discussion diverges very quickly into file name >> discussions, but I think that is not relevant here - environment variables >> are essentially data "blobs" that have application-specific meaning. They >> just chose to drop them because they didn't want to make a decision. >> >> I don't have a strong opinion on this, but Sys.getenv("FOO") and >> Sys.getenv()[["FOO"]] should not yield two different results. I would argue >> that if we want to make specific checks, we should make them conditional - >> even if the default is to add them. Again, the error is due to the >> implementation of Sys.getenv() breaking in R-devel, not due to any design >> decision. >> >> Cheers, >> Simon >> >> >>> On Jan 31, 2023, at 1:04 PM, Tomas Kalibera <tomas.kalib...@gmail.com> >>> wrote: >>> >>> >>> On 1/30/23 23:01, Henrik Bengtsson wrote: >>>> /Hello. >>>> >>>> SUMMARY: >>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()" >>>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' >>>> >>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')" >>>> [1] "\xff" >>>> >>>> BACKGROUND: >>>> I launch R through an Son of Grid Engine (SGE) scheduler, where the R >>>> process is launched on a compute host via 'qrsh', which part of SGE. >>>> Without going into details, 'mpirun' is also involved. Regardless, in >>>> this process, an 'qrsh'-specific environment variable 'QRSH_COMMAND' >>>> is set automatically. The value of this variable comprise of a string >>>> with \xff (ASCII 255) injected between the words. This is by design >>>> of SGE [1]. Here is an example of what this environment variable may >>>> look like: >>>> >>>> QRSH_COMMAND= >>>> orted\xff--hnp-topo-sig\xff2N:2S:32L3:128L2:128L1:128C:256H:x86_64\xff-mca\xffess\xff\"env\"\xff-mca\xfforte_ess_jobid\xff\"3473342464\"\xff-mca\xfforte_ess_vpid\xff1\xff-mca\xfforte_ess_num_procs\xff\"3\"\xff-mca\xfforte_hnp_uri\xff\"3473342464.0;tcp://192.168.1.13:50847\"\xff-mca\xffplm\xff\"rsh\"\xff-mca\xfforte_tag_output\xff\"1\"\xff--tree-spawn" >>>> >>>> where each \xff is a single byte 255=0xFF=\xFF. >>>> >>>> >>>> ISSUE: >>>> An environment variable with embedded 0xFF bytes in its value causes >>>> calls to Sys.getenv() to produce an error when running R in a UTF-8 >>>> locale. Here is a minimal example on Linux: >>>> >>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv()" >>>> Error in substring(x, m + 1L) : invalid multibyte string at '<ff>' >>>> Calls: Sys.getenv -> substring >>>> In addition: Warning message: >>>> In regexpr("=", x, fixed = TRUE) : >>>> input string 134 is invalid in this locale >>>> Execution halted >>>> >>>> >>>> WORKAROUND: >>>> The workaround is to (1) identify any environment variables with >>>> invalid UTF-8 symbols, and (2) prune or unset those variables before >>>> launching R, e.g. in my SGE case, launching R using: >>>> >>>> QRSH_COMMAND= Rscript --vanilla -e "Sys.getenv()" >>>> >>>> avoid the problem. Having to unset/modify environment variables >>>> because R doesn't like them, see a bit of an ad-hoc hack to me. Also, >>>> if you are not aware of this problem, or not a savvy R user, it can be >>>> quite tricky to track down the above error message, especially if >>>> Sys.getenv() is called deep down in some package dependency. >>>> >>>> >>>> DISCUSSION/SUGGESTION/ASK: >>>> My suggestion would be to make Sys.getenv() robust against any type of >>>> byte values in environment variable strings. >>>> >>>> The error occurs in Sys.getenv() from: >>>> >>>> x <- .Internal(Sys.getenv(character(), "")) >>>> m <- regexpr("=", x, fixed = TRUE) ## produces a warning >>>> n <- substring(x, 1L, m - 1L) >>>> v <- substring(x, m + 1L) ## produces the error >>>> >>>> I know too little about string encodings, so I'm not sure what the >>>> best approach would be here, but maybe falling back to parsing strings >>>> that are invalid in the current locale using the C locale would be >>>> reasonable? Maybe Sys.getenv() should always use the C locale for >>>> this. It looks like Sys.getenv(name) does this, e.g. >>>> >>>> $ BOOM=$'\xFF' LC_ALL=en_US.UTF-8 Rscript --vanilla -e "Sys.getenv('BOOM')" >>>> [1] "\xff" >>>> >>>> I'd appreciate any comments and suggestions. I'm happy to file a bug >>>> report on BugZilla, if this is a bug. >>> This discussion comes from Python: https://bugs.python.org/issue4006 >>> (it says Python skips such environment variables) >>> >>> The problem of invalid strings in environment variables is a similar to the >>> problem of invalid strings in file names. Both variables and file names are >>> something people want to use as strings in programs, scripts, texts, but at >>> the same time these may in theory not be valid strings. Working with >>> (potentially) invalid strings (almost) transparently is much harder than >>> with valid strings; even if R decided to do that, it would be hard to >>> implement and take long and only work for some operations, most will still >>> throw errors. In addition, in practice invalid strings are almost always >>> due to an error, particularly so in file names or environment variables. >>> Such errors are often worth catching (wrong encoding declaration, etc), >>> even though perhaps not always. >>> >>> In practice, this instance can only be properly fixed at the source, [1] >>> should not do this. split_command() will run into problems with different >>> software, not just R. >>> >>> There should be a way to split the commands in ASCII (using some sort of >>> quoting/escaping). Using \xFF is flawed also simply because it may be >>> present in the commands, if we followed the same logic of that every byte >>> is fine. So the code is buggy even regardless of multi-byte encodings. >>> >>> Re difficulty to debug, I think the error message is clear and if packages >>> catch and hide errors, that'd be bad design of such packages, R couldn't >>> really do much about that. This needs to be fixed at [1]. >>> >>> Tomas >>> >>>> >>>> Henrik >>>> >>>> [1] >>>> https://github.com/gridengine/gridengine/blob/master/source/clients/qrsh/qrsh_starter.c#L462-L466 >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel