Re: [GRASS-dev] Python 3 porting and unicode
Vaclav Petras wrote: > * There is no way around the unicode when using Python 3. Unicode is > inherent part of the language even things such as os.environ or > sys.stdout.write() work only with unicode. I'm not sure what exactly the > rule is here, but it seems to be everywhere. Python 3 has os.environb on Unix. You can use the .detach() method on text streams to get the underlying binary stream. > * In relation to the previous point, one of the reasons why unicode is used > that thinks like text[:10] actually return 10 characters to display. Although some of those characters may be combining characters or control codes. Unicode characters don't necessarily map 1:1 with glyphs. > * Users of the Python API who are using Python 3 will expect unicode > strings to work, i.e. expect run_command('g.region', flags='p') to work > (not just run_command(b'g.region', flags=b'p')). Even if you automatically encode unicode strings, there's no guarantee that it will work (e.g. if the string is a filename, then the encoded string must produce the correct sequence of bytes). I can't think of any significant cases where it's likely to be necessary to pass "binary" data via arguments, although it should be trivial to simply accept data which is already a byte string. The bigger issue is with output: the output from GRASS commands isn't guaranteed to be in the locale's encoding (if it's extracted from a file, it's going to be in whatever encoding the file uses). Returning bytes allows the user to deal with this; automatically decoding the data will either raise an exception or return mojibake if the encoding doesn't match. > * It seems hard to predict when we will know the right encoding of the > text. Which is why byte-oriented interfaces still exist and still matter, and will do so for the foreseeable future. Python's solution is to accelerate standardisation on Unicode by making the alternatives as painful as possible. Yet legacy encodings remain widespread -- Glynn Clements___ grass-dev mailing list grass-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/grass-dev
Re: [GRASS-dev] Python 3 porting and unicode
Hi Vaclav, I think that it would make much more sense to have the GRASS python libraries using unicode, and to add an interface managing the translation to/from bytes when dealing with C code. Python programmers using the GRASS libraries will expect unicode strings. Laurent 2017-11-26 21:21 GMT-06:00 Vaclav Petras: > Dear all, > > after looking at different Python 2 to 3 porting issues, doing r71849, and > reading #3392, I understand the following: > > * Several solutions for poring exist. Most recent one is python-future > project, but only from __future__ import ... is part of the library and thus > guaranteed with recent Python 2.7. (We can discuss concrete steps > separately.) > > * However, the most challenging part of the porting will be the unicode. > > * There is no way around the unicode when using Python 3. Unicode is > inherent part of the language even things such as os.environ or > sys.stdout.write() work only with unicode. I'm not sure what exactly the > rule is here, but it seems to be everywhere. > > * I haven't seen any simple fix which would limit the changes in the code in > a way, e.g., in which print statement can be fixed. > > * GUI will always use unicode because that's how the libraries and > interfaces as set. > > * In relation to the previous point, one of the reasons why unicode is used > that thinks like text[:10] actually return 10 characters to display. > > * C library will not use unicode for now. > > * Users of the Python API who are using Python 3 will expect unicode strings > to work, i.e. expect run_command('g.region', flags='p') to work (not just > run_command(b'g.region', flags=b'p')). > > * If Python libraries are unicode, there will need to be an interface to > work with ctypes which would add to existing code for transferring from C > world to Python and back. > > * If Python libraries are bytes, there will need to be an interface to work > with GUI in unicode as well as with users of the API who will expect unicode > to work. In other words, internally it would use bytes, but interface must > be both bytes (for modules and internal use) and unicode (for GUI and > users). > > * Having unicode-based library means encoding and decoding on any "external" > interface such as file reading or ctypes. > > * Having bytes-based library means encoding and decoding on any interface > such as Python 3 interface such as os.environ and additionally rewriting all > string literals ("abc") to bytes (b"abc"). > > * It seems hard to predict when we will know the right encoding of the text. > It seems that we will need it with any solution since garbage-in-garbage > stops when you need to use some system interface function in Python 3 which > requires unicode. Although e.g. sys.stdout.write() has a (less generic) > sys.stdout.buffer.write() alternative, os.environb does not work on MS > Windows. > > An example fix in r71849 is done using a (custom) decode function which > creates unicode (standard string in Python3) when file content is read. > Alternative to this change would be changing all the strings in the file to > bytes (b'abc' as opposed to 'abc'). > > Please comment or link other related discussions. > > Thanks, > Vaclav > > > python3 -c "import os; os.environ[b'abc'] = b'def'" > python3 -c "import os; os.environb[b'abc'] = b'def'" > python3 -c "import sys; sys.stdout.write(b'abc\n')" > python3 -c "import sys; sys.stdout.buffer.write(b'abc\n')" > python3 -c "import os; print(type(os.name))" > https://trac.osgeo.org/grass/changeset/71849 > https://trac.osgeo.org/grass/ticket/2708 > https://trac.osgeo.org/grass/ticket/3392 > https://trac.osgeo.org/grass/query?status=!closed=~python3 > https://trac.osgeo.org/grass/query?status=!closed=~encoding > https://trac.osgeo.org/grass/query?status=!closed=~unicode > > ___ > grass-dev mailing list > grass-dev@lists.osgeo.org > https://lists.osgeo.org/mailman/listinfo/grass-dev ___ grass-dev mailing list grass-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/grass-dev
[GRASS-dev] Python 3 porting and unicode
Dear all, after looking at different Python 2 to 3 porting issues, doing r71849, and reading #3392, I understand the following: * Several solutions for poring exist. Most recent one is python-future project, but only from __future__ import ... is part of the library and thus guaranteed with recent Python 2.7. (We can discuss concrete steps separately.) * However, the most challenging part of the porting will be the unicode. * There is no way around the unicode when using Python 3. Unicode is inherent part of the language even things such as os.environ or sys.stdout.write() work only with unicode. I'm not sure what exactly the rule is here, but it seems to be everywhere. * I haven't seen any simple fix which would limit the changes in the code in a way, e.g., in which print statement can be fixed. * GUI will always use unicode because that's how the libraries and interfaces as set. * In relation to the previous point, one of the reasons why unicode is used that thinks like text[:10] actually return 10 characters to display. * C library will not use unicode for now. * Users of the Python API who are using Python 3 will expect unicode strings to work, i.e. expect run_command('g.region', flags='p') to work (not just run_command(b'g.region', flags=b'p')). * If Python libraries are unicode, there will need to be an interface to work with ctypes which would add to existing code for transferring from C world to Python and back. * If Python libraries are bytes, there will need to be an interface to work with GUI in unicode as well as with users of the API who will expect unicode to work. In other words, internally it would use bytes, but interface must be both bytes (for modules and internal use) and unicode (for GUI and users). * Having unicode-based library means encoding and decoding on any "external" interface such as file reading or ctypes. * Having bytes-based library means encoding and decoding on any interface such as Python 3 interface such as os.environ and additionally rewriting all string literals ("abc") to bytes (b"abc"). * It seems hard to predict when we will know the right encoding of the text. It seems that we will need it with any solution since garbage-in-garbage stops when you need to use some system interface function in Python 3 which requires unicode. Although e.g. sys.stdout.write() has a (less generic) sys.stdout.buffer.write() alternative, os.environb does not work on MS Windows. An example fix in r71849 is done using a (custom) decode function which creates unicode (standard string in Python3) when file content is read. Alternative to this change would be changing all the strings in the file to bytes (b'abc' as opposed to 'abc'). Please comment or link other related discussions. Thanks, Vaclav python3 -c "import os; os.environ[b'abc'] = b'def'" python3 -c "import os; os.environb[b'abc'] = b'def'" python3 -c "import sys; sys.stdout.write(b'abc\n')" python3 -c "import sys; sys.stdout.buffer.write(b'abc\n')" python3 -c "import os; print(type(os.name))" https://trac.osgeo.org/grass/changeset/71849 https://trac.osgeo.org/grass/ticket/2708 https://trac.osgeo.org/grass/ticket/3392 https://trac.osgeo.org/grass/query?status=!closed=~python3 https://trac.osgeo.org/grass/query?status=!closed=~encoding https://trac.osgeo.org/grass/query?status=!closed=~unicode ___ grass-dev mailing list grass-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/grass-dev