Re: [GRASS-dev] Moving GRASS Python parts to Unicode
Maris Nartiss wrote: > as you might already have noticed, there is a constant stream of > issues containing keywords "encoding" or more often > "UnicodeDecodeError". The main reason behind this is Python 2.x two > types of text strings - byte sequence (one you get with str()) and > Unicode (unicode()). Python 3.x will have only one - Unicode (byte > sequence is not a string any more) thus fixing this frustrating source > of errors. Both versions have both types of string. In 2.x, str() and "plain" string literals create byte strings, while unicode() and u"..." create unicode strings. In 3.x, str() and plain string literals create unicode strings, while bytes() and b"..." create byte strings. The biggest differences between the two are: a) 2.x allows implicit conversions. If you pass a byte string where a unicode string is expected (or vice versa), the string is implicitly converted using the default encoding (which can't be set by a script). 3.x doesn't do this; you get an exception. b) 3.x tries quite hard to maintain the fiction that everything is unicode. E.g. sys.argv contains unicode strings, os.environ uses unicode strings for both keys and values, sys.stdin/stdout/stderr are text streams which return Unicode data. > Moving GRASS Python code to use Unicode internally will make it closer > to Python 3 ready and solve largest part of errors caused by implicit > conversation from encoded text strings to Unicode text strings. I don't particularly care what happens with wxGUI, and using unicode consistently would make sense there, as wx itself uses Unicode. But if you're planning on doing this to grass.script, I'm strongly opposed. It achieves nothing beyond making what should be wxGUI's problem into everyone else's problem. Pretending that everything is unicode only works so long as the rest of the world makes sure not to dispel the illusion. Otherwise, it fails hard. Something as simple as e.g. copying stdin to stdout fails just because the data isn't in the assumed encoding. Bear in mind that the C portion of GRASS (i.e. most of it) doesn't pay any attention to encodings unless it has to. It just passes bytes around. It doesn't care whether the bytes are in any particular encoding, and certainly won't attempt to ensure that data written to stdout or to files is in any particular encoding. -- Glynn Clements___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev
Re: [GRASS-dev] Moving GRASS Python parts to Unicode
2016-02-10 17:03 GMT+02:00 Anna Petrášová: > > > On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert > wrote: >> >> Hi Maris, >> >> I would be very happy if we could find a structural solution to this which >> would avoid having to deal with so many individual errors all the time. That is my proposal. Get it right + policy to enforce to avoid breakdown in the future. >> >> I am no expert on this question, and thus do not have a clear opinion on >> your proposal, except for the fact that I'm very happy that it exists, but >> here are my intuitive ideas & questions on your topics: Neither am I. I just got fed up with UnicodeDecodeError. > > I don't have a clear opinion either but I hoped Glynn could state his > opinion here, because I understood he has a different view on some of these > things. AFAIR, one of the problems is possibly different needs of Python > scripting library vs. GUI. > > Anna Anna, there should be no other "special" way of treating some parts of Python code. If it is Python, it should follow Python idioms. That's the whole point of using Python at the first place - to provide Pythonic access to power of GRASS. I do not see in any near future any significant changes in Python community moving away from Unicode strings to raw byte strings for texts thus either we adopt Pythonic approach or continue to fight uphill battle with Python. So far we are not going too well with it. >> >> >> >>> Topic to discuss: >>> 1) Implementation plan: >>> a) should it be done before 7.1? >> >> >> I think the sooner, the better, so 7.1 should be our latest milestone >> (7.0.x should be in 'bugfix only mode). Depends on how far is 7.1. I would prefer to have GRASS releases more often, then it should go to 7.2. >>> b) should separate bugs be opened for parts of migration? >> >> >> To what point can different issues be delimited into +/- autonomous issues >> ? Good question. >>> c) how big / long breakage is acceptable? >> >> >> How complete would breakage be: for all encodings, or would LANG=C always >> work ? Only partially. There are no UnicodeEncodingErrors for LANG=C, but there will be UnicodeUnequalError instead when comparing Unicode string to byte string. >> >> Is this something which could be done for most part in a concentrated >> manner during a code sprint (e.g. FOSS4G 2016) ? I am not so familiar with whole codebase. >>> 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus >>> pushing the encode/decode "boundary" further. Upside - most of >>> existing data is UTF-8 ready (parts supporting only ASCII) [4]. >> >> >> What do you mean with "text in GRASS location" ? How about files on the >> filesystem that some users might want to access via other tools ? Shouldn't >> they be in the system-wide encoding ? I meant any text strings (raster categories, metadata entries, etc.). System-wide encoding makes GRASS location non-portable. I can not just copy it to other system and expect to work. UTF-8 would be a natural choice as it is backwards compatible with ASCII (existing data does not need to be changed) and at the same would allow to accept any characters in the future. Besides - it is used by 86% of Web [1]. If we introduce such policy, the same principle would apply - decode early, encode late. On the bright side - legacy systems are dying out, MacOS uses UTF-8 for all locales by default, Linux has nice UTF-8 support (my guess - it is the most popular encoding after plain ASCII). Current situation that data is in unknown encoding is the worst - either we adopt this approach, or start to store metadata on encoding in use. I assume anyone who has been playing game "guess the encoding of Shapefile" will agree on downsides of such approach. Anyway - this is discussion about GRASS 8. >> >> Thank you very much for bringing up this discussion in such a structured >> manner. I hope that others will show some interest in the matter... >> >> Moritz I hope so. Dziękuje, Māris. 1. http://w3techs.com/technologies/overview/character_encoding/all ___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev
Re: [GRASS-dev] Moving GRASS Python parts to Unicode
On Sat, Feb 13, 2016 at 11:30 AM, Maris Nartisswrote: ... > Depends on how far is 7.1. I would prefer to have GRASS releases more > often, then it should go to 7.2. Agreed. We could change this in trunk (current 7.1.svn) and then release it as stable 7.2. Markus ___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev
Re: [GRASS-dev] Moving GRASS Python parts to Unicode
Hi, 2016-02-13 12:13 GMT+01:00 Markus Neteler: > Agreed. We could change this in trunk (current 7.1.svn) and then > release it as stable 7.2. I am confused, I thought that the next stable release will be 7.1. and not 7.2., in other words some months (let's say two) before planned release of 7.1. we create from trunk releasebranch_7_1 and trunk becomes 7.2.svn. Or do we want to create releasebranch_7_2 and trunk turns into 7.3svn (version 7.1.x will be never released)? My vote would be for first option - release 7.1. as stable. Ma -- Martin Landa http://geo.fsv.cvut.cz/gwiki/Landa http://gismentors.cz/mentors/landa ___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev
Re: [GRASS-dev] Moving GRASS Python parts to Unicode
On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert < mlenn...@club.worldonline.be> wrote: > Hi Maris, > > On 07/02/16 11:56, Maris Nartiss wrote: > >> Hello devs, >> as you might already have noticed, there is a constant stream of >> issues containing keywords "encoding" or more often >> "UnicodeDecodeError". The main reason behind this is Python 2.x two >> types of text strings - byte sequence (one you get with str()) and >> Unicode (unicode()). Python 3.x will have only one - Unicode (byte >> sequence is not a string any more) thus fixing this frustrating source >> of errors. >> Moving GRASS Python code to use Unicode internally will make it closer >> to Python 3 ready and solve largest part of errors caused by implicit >> conversation from encoded text strings to Unicode text strings. >> > > I would be very happy if we could find a structural solution to this which > would avoid having to deal with so many individual errors all the time. > > >> The proposal is to make GRASS GIS Python code complaint with Unicode >> best practice [1] following principle "decode early, encode late". >> Things to change: >> 1) Any text string entering Python part of code should be decoded at >> its entry point and decoded back to byte sequence at its exit point. >> It also applies to all calls to GRASS modules passing around text; >> 2) Replace all text strings with Unicode literals (u'text'). No >> exceptions. Note - "text strings" - thus byte sequences should not be >> touched; >> 3) Ensure text file reading / writing is done via codecs.open; >> 4) Pass only Unicode to Python file handling calls (this is important >> for running on MS-Windows); >> 5) Use Unicode in tests to ensure correctness of code; >> 6) Introduce information on Unicode usage into Python submitting >> guidelines [2],[3]. >> >> Things to change outside of Python code: >> 1) Store attribute table encoding information along with connection >> parameters; >> 2) Ensure storage of correct encoding information on data import and >> correct use on export (especially painful for ESRI Shapefiles); >> 3) Ensure correct encoding information in headers of all PO and XML files. >> >> Expected problems: >> 1) When moving to Python 3, all explicit Unicode literal definitions >> will need to be removed (u'text' -> 'text'); >> 2) Introduction of "encode early" principle will break all of the >> band-aids currently in place - a major breakage of code for a short >> time is expected; >> 3) Guessing correct encoding can be a problem. One of solutions could >> be checking early for correctness of system configuration and refusing >> to operate on improperly configured systems. Fatal error is better >> than silent data corruption (as it is happening at the moment for >> certain scenarios). >> >> > I am no expert on this question, and thus do not have a clear opinion on > your proposal, except for the fact that I'm very happy that it exists, but > here are my intuitive ideas & questions on your topics: I don't have a clear opinion either but I hoped Glynn could state his opinion here, because I understood he has a different view on some of these things. AFAIR, one of the problems is possibly different needs of Python scripting library vs. GUI. Anna > > > Topic to discuss: >> 1) Implementation plan: >> a) should it be done before 7.1? >> > > I think the sooner, the better, so 7.1 should be our latest milestone > (7.0.x should be in 'bugfix only mode). > > b) should separate bugs be opened for parts of migration? >> > > To what point can different issues be delimited into +/- autonomous issues > ? > > c) how big / long breakage is acceptable? >> > > How complete would breakage be: for all encodings, or would LANG=C always > work ? > > Is this something which could be done for most part in a concentrated > manner during a code sprint (e.g. FOSS4G 2016) ? > > 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus >> pushing the encode/decode "boundary" further. Upside - most of >> existing data is UTF-8 ready (parts supporting only ASCII) [4]. >> > > What do you mean with "text in GRASS location" ? How about files on the > filesystem that some users might want to access via other tools ? Shouldn't > they be in the system-wide encoding ? > > Thank you very much for bringing up this discussion in such a structured > manner. I hope that others will show some interest in the matter... > > Moritz > > ___ > grass-dev mailing list > grass-dev@lists.osgeo.org > http://lists.osgeo.org/mailman/listinfo/grass-dev > ___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev
Re: [GRASS-dev] Moving GRASS Python parts to Unicode
Hi Maris, On 07/02/16 11:56, Maris Nartiss wrote: Hello devs, as you might already have noticed, there is a constant stream of issues containing keywords "encoding" or more often "UnicodeDecodeError". The main reason behind this is Python 2.x two types of text strings - byte sequence (one you get with str()) and Unicode (unicode()). Python 3.x will have only one - Unicode (byte sequence is not a string any more) thus fixing this frustrating source of errors. Moving GRASS Python code to use Unicode internally will make it closer to Python 3 ready and solve largest part of errors caused by implicit conversation from encoded text strings to Unicode text strings. I would be very happy if we could find a structural solution to this which would avoid having to deal with so many individual errors all the time. The proposal is to make GRASS GIS Python code complaint with Unicode best practice [1] following principle "decode early, encode late". Things to change: 1) Any text string entering Python part of code should be decoded at its entry point and decoded back to byte sequence at its exit point. It also applies to all calls to GRASS modules passing around text; 2) Replace all text strings with Unicode literals (u'text'). No exceptions. Note - "text strings" - thus byte sequences should not be touched; 3) Ensure text file reading / writing is done via codecs.open; 4) Pass only Unicode to Python file handling calls (this is important for running on MS-Windows); 5) Use Unicode in tests to ensure correctness of code; 6) Introduce information on Unicode usage into Python submitting guidelines [2],[3]. Things to change outside of Python code: 1) Store attribute table encoding information along with connection parameters; 2) Ensure storage of correct encoding information on data import and correct use on export (especially painful for ESRI Shapefiles); 3) Ensure correct encoding information in headers of all PO and XML files. Expected problems: 1) When moving to Python 3, all explicit Unicode literal definitions will need to be removed (u'text' -> 'text'); 2) Introduction of "encode early" principle will break all of the band-aids currently in place - a major breakage of code for a short time is expected; 3) Guessing correct encoding can be a problem. One of solutions could be checking early for correctness of system configuration and refusing to operate on improperly configured systems. Fatal error is better than silent data corruption (as it is happening at the moment for certain scenarios). I am no expert on this question, and thus do not have a clear opinion on your proposal, except for the fact that I'm very happy that it exists, but here are my intuitive ideas & questions on your topics: Topic to discuss: 1) Implementation plan: a) should it be done before 7.1? I think the sooner, the better, so 7.1 should be our latest milestone (7.0.x should be in 'bugfix only mode). b) should separate bugs be opened for parts of migration? To what point can different issues be delimited into +/- autonomous issues ? c) how big / long breakage is acceptable? How complete would breakage be: for all encodings, or would LANG=C always work ? Is this something which could be done for most part in a concentrated manner during a code sprint (e.g. FOSS4G 2016) ? 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus pushing the encode/decode "boundary" further. Upside - most of existing data is UTF-8 ready (parts supporting only ASCII) [4]. What do you mean with "text in GRASS location" ? How about files on the filesystem that some users might want to access via other tools ? Shouldn't they be in the system-wide encoding ? Thank you very much for bringing up this discussion in such a structured manner. I hope that others will show some interest in the matter... Moritz ___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev
[GRASS-dev] Moving GRASS Python parts to Unicode
Hello devs, as you might already have noticed, there is a constant stream of issues containing keywords "encoding" or more often "UnicodeDecodeError". The main reason behind this is Python 2.x two types of text strings - byte sequence (one you get with str()) and Unicode (unicode()). Python 3.x will have only one - Unicode (byte sequence is not a string any more) thus fixing this frustrating source of errors. Moving GRASS Python code to use Unicode internally will make it closer to Python 3 ready and solve largest part of errors caused by implicit conversation from encoded text strings to Unicode text strings. The proposal is to make GRASS GIS Python code complaint with Unicode best practice [1] following principle "decode early, encode late". Things to change: 1) Any text string entering Python part of code should be decoded at its entry point and decoded back to byte sequence at its exit point. It also applies to all calls to GRASS modules passing around text; 2) Replace all text strings with Unicode literals (u'text'). No exceptions. Note - "text strings" - thus byte sequences should not be touched; 3) Ensure text file reading / writing is done via codecs.open; 4) Pass only Unicode to Python file handling calls (this is important for running on MS-Windows); 5) Use Unicode in tests to ensure correctness of code; 6) Introduce information on Unicode usage into Python submitting guidelines [2],[3]. Things to change outside of Python code: 1) Store attribute table encoding information along with connection parameters; 2) Ensure storage of correct encoding information on data import and correct use on export (especially painful for ESRI Shapefiles); 3) Ensure correct encoding information in headers of all PO and XML files. Expected problems: 1) When moving to Python 3, all explicit Unicode literal definitions will need to be removed (u'text' -> 'text'); 2) Introduction of "encode early" principle will break all of the band-aids currently in place - a major breakage of code for a short time is expected; 3) Guessing correct encoding can be a problem. One of solutions could be checking early for correctness of system configuration and refusing to operate on improperly configured systems. Fatal error is better than silent data corruption (as it is happening at the moment for certain scenarios). Topic to discuss: 1) Implementation plan: a) should it be done before 7.1? b) should separate bugs be opened for parts of migration? c) how big / long breakage is acceptable? 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus pushing the encode/decode "boundary" further. Upside - most of existing data is UTF-8 ready (parts supporting only ASCII) [4]. 1. http://unicodebook.readthedocs.org/good_practices.html 2. http://www.azavea.com/blogs/labs/2014/03/solving-unicode-problems-in-python-2-7/ 3. https://docs.python.org/2/howto/unicode.html 4. http://utf8everywhere.org/ Jauku dienu; miłego dnia; хорошего дня, Māris. Moved from trac ticket https://trac.osgeo.org/grass/ticket/2885 ___ grass-dev mailing list grass-dev@lists.osgeo.org http://lists.osgeo.org/mailman/listinfo/grass-dev