On Fri, Jan 18, 2019 at 1:48 PM Ben Coman via Pharo-dev < pharo-dev@lists.pharo.org> wrote:
> > > > > On Wed, 16 Jan 2019 at 18:37, Sven Van Caekenberghe <s...@stfx.eu> wrote: > >> Still, one of the conclusions of previous discussions about the encoding >> of environment variables was/is that there is no single correct solution. >> OS's are not consistent in how the encoding is done in all (historical) >> contexts (like sometimes, > > > >> 1 env var defines the encoding to use for others, > > > ouch. That one point nearly made my retract my comment next paragraph, > but is there much more complexity? > or just a case of utf8<==>appSpecificEncoding rather than > ascii<==>appSpecificEncoding ? > It's not muuuuch more complex. The problem is that usually the bugs that arise from wrongly managing such conversions can be super obscure. > Sorry if I'm rehashing past discussion (do you have a link?), but > considering... > * 92% of web pages are UTF8 encoded[1] such that pragmatically UTF8 *is* > the standard for text > * Strings so pervasive in a system > ...would there be an overall benefit to adopt UTF8 as the encoding for > Strings > consistently provided across the cross-platform vm interface? > (i.e. fixing platforms that don't comply to the standard due to their > historical baggage) > > And I found it interesting Microsoft are making some moves towards UTF8 > [2]... > "With insider build 17035 and the April 2018 update (nominal build 17134) > for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" > checkbox appeared for setting the locale code page to UTF-8.[a] This allows > for calling "narrow" functions, including fopen and SetWindowTextA, with > UTF-8 strings. " > > The approach vm-side could be similar to Section 10 How to do text on > Windows [3] > with the philosophy of "performing the [conversions] as close to API calls > as possible, > and never holding the [converted] data." > > [1] > https://w3techs.com/technologies/history_overview/character_encoding/ms/y > [2] https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows > [3] http://utf8everywhere.org/ > > > different applications do different things, and other such nice stuff), >> and certainly not across platforms. >> >> So this is really complex. >> >> Do we want to hide this in some obscure VM C code that very few people >> can see, read, let alone help with ? >> >> The image side is perfectly capable of dealing with platform differences >> in a clean/clear way, and at least we can then use the full power of our >> language and our tools. >> > > Big question... Do we currently have primitives of the same name returning > different encodings on different platforms? I presume that would be > awkward. > If the image is handle encoding differences, should separate primitives be > used? e.g. utf8GetEnv & utf16getEnv > > Could I get some feedback on [4] saying... **The Single Most Important > Fact About Encodings** > If you completely forget everything I just explained, please remember one > extremely important fact. > It does not make sense to have a string without knowing what encoding it > uses. " > > And so... does our String nowadays require an 'encoding' instance variable > such that this is *always* associated? > This might remove any need for separate utf8GetEnv & utf16getEnv (if that > was even a reasonable idea). > I think that will just overcomplicate things. Right now, all Strings in Pharo are unicode strings. Characters are represented with their corresponding unicode codepoint. If all characters in a string have codepoints < 256 then they are just stored in a bytestring. Otherwise they are WideStrings. I think assuming a single representation for strings, and then encode when interacting with external apps/APIs is MUCH simpler. > cheers -ben > > [4] > https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ > > > >> > On 16 Jan 2019, at 10:59, Guillermo Polito <guillermopol...@gmail.com> >> wrote: >> > >> > Hi Nicolas, >> > >> > On Wed, Jan 16, 2019 at 10:25 AM Nicolas Cellier < >> nicolas.cellier.aka.n...@gmail.com> wrote: >> > IMO, windows VM (and plugins) should do the UCS2 -> UTF8 conversion >> because the purpose of a VM is to provide an OS independant façade. >> > I made progress recently in this area, but we should finish the >> job/test/consolidate. >> > >> > I'm following your changes for windows from the shadows and I think >> they are awesome :). >> > >> > If someone bypass the VM and use direct windows API thru FFI, then he >> takes the responsibility, but uniformity doesn't hurt. >> > >> > So far we are using FFI for this, as you say we create first >> Win32WideStrings from utf8 strings and then we use ffi calls to the *W >> functions. >> > I don't think we can make it for Pharo7.0.0. The cycle to build, do >> some acceptance tests, and then bless a new VM as stable is far too long >> for our inminent release :). >> > >> > But this could be for a 7.1.0, and if you like I can surely give a hand >> on this. >> > >> > Guille >> >> >> -- Guille Polito Research Engineer Centre de Recherche en Informatique, Signal et Automatique de Lille CRIStAL - UMR 9189 French National Center for Scientific Research - *http://www.cnrs.fr <http://www.cnrs.fr>* *Web:* *http://guillep.github.io* <http://guillep.github.io> *Phone: *+33 06 52 70 66 13