On Tue, 09 Dec 2025, Rob Browning <[email protected]> wrote: [...] > It may be that if Guile were to encode/decode all "system data" using > that 'noncharacters strategy, then as (eventually) with python, many > Guile programs, both existing and future, would "just work" without the > authors needing to be aware of all of the related concerns/complexity > --- and those who do care would have options. > > For example, you'd end up with "foo\ufddb\ufdd5" for a $'foo\xb5' on the > command line for a UTF-8 locale, instead of just losing the information > and receiving "foo?" via 'substitute, as you do now. And were you to > write that argument to stdout, 'noncharacters would just automatically > reverse it back to $'foo\xb5'.
This is a concern for Unix platforms and for system programming like GASH: $ LC_ALL=C mkdir $'foo\xb5' $ cd $'foo\xb5' $ gash -c pwd /home/old/tmp/foo? With the advent of Guile Pre-Scheme as a system programming language, I think this concern is real and should be addressed properly. Regarding the surrogate-escape, I think it would be possible for a malicious actor to somehow inject bytes that would go undetected by a sanitization procedure. The malicious string would then get converted back to bytes to be sent to the OS. In the context of system programming, especially with setuid programs, the surrogate-escape approach seems somewhat dangerous if not only and only used to do conversion of OS strings given by the OS itself. This is also a problem on its own. The conversion between runtime strings and OS strings need to be done at every boundaries to be transparent to users. > But even if noncharacters were plausible, it raises questions. For > example, ports currently have a %default-port-encoding and > %default-port-conversion-strategy which are fluids that default to the > locale encoding and 'substitute respectively. (And some string > functions borrow these defaults. For example scm_to_locale_string(n) > uses the %default-port-conversion-strategy.) > > - Can/should we eventually change the default port strategy from > 'substitute to 'error (or 'noncharacters if we go that route), so > that you have to explicitly request a strategy that loses > information? (I'm still inclined to think so.) Since this is only a concern for system programming, I would argue that only those would need to change the default port strategy. Also, I personally would not want the default to be `error' and prefer `substitute'. The former would break any program that print a UTF-8 string that encodes a Latin character such as `é' when run on a CI that has `LANG=C'. [...] > ...and again, I *think* I'm mostly wondering about an incremental > improvement. I could imagine that with sufficient resources, some might > also want a way to work with all of the system data as bytevectors. But > I currently view that as "nice to have" since I suspect it's a lot more > work for something that won't be all that much more efficient for common > cases (if we do switch to UTF-8 internally), and something that would > require a lot more work (design and code) to be anywhere near as > convenient. For example, we have far more support for manipulating > paths as strings than we do for manipulating them as bytevectors. If the concern is only about paths, then what we may want is a better path abstraction than strings. Opaque path objects with a set of operations would hide all the details internally and we could expose getters for the underlying bytevector or encode it to a string with the desired conversion strategy, including surrogate-escape. [...] Thanks, Olivier -- Olivier Dion oldiob.ca
