Hello devs, as you might already have noticed, there is a constant stream of issues containing keywords "encoding" or more often "UnicodeDecodeError". The main reason behind this is Python 2.x two types of text strings - byte sequence (one you get with str()) and Unicode (unicode()). Python 3.x will have only one - Unicode (byte sequence is not a string any more) thus fixing this frustrating source of errors. Moving GRASS Python code to use Unicode internally will make it closer to Python 3 ready and solve largest part of errors caused by implicit conversation from encoded text strings to Unicode text strings.
The proposal is to make GRASS GIS Python code complaint with Unicode best practice [1] following principle "decode early, encode late". Things to change: 1) Any text string entering Python part of code should be decoded at its entry point and decoded back to byte sequence at its exit point. It also applies to all calls to GRASS modules passing around text; 2) Replace all text strings with Unicode literals (u'text'). No exceptions. Note - "text strings" - thus byte sequences should not be touched; 3) Ensure text file reading / writing is done via codecs.open; 4) Pass only Unicode to Python file handling calls (this is important for running on MS-Windows); 5) Use Unicode in tests to ensure correctness of code; 6) Introduce information on Unicode usage into Python submitting guidelines [2],[3]. Things to change outside of Python code: 1) Store attribute table encoding information along with connection parameters; 2) Ensure storage of correct encoding information on data import and correct use on export (especially painful for ESRI Shapefiles); 3) Ensure correct encoding information in headers of all PO and XML files. Expected problems: 1) When moving to Python 3, all explicit Unicode literal definitions will need to be removed (u'text' -> 'text'); 2) Introduction of "encode early" principle will break all of the band-aids currently in place - a major breakage of code for a short time is expected; 3) Guessing correct encoding can be a problem. One of solutions could be checking early for correctness of system configuration and refusing to operate on improperly configured systems. Fatal error is better than silent data corruption (as it is happening at the moment for certain scenarios). Topic to discuss: 1) Implementation plan: a) should it be done before 7.1? b) should separate bugs be opened for parts of migration? c) how big / long breakage is acceptable? 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus pushing the encode/decode "boundary" further. Upside - most of existing data is UTF-8 ready (parts supporting only ASCII) [4]. 1. http://unicodebook.readthedocs.org/good_practices.html 2. http://www.azavea.com/blogs/labs/2014/03/solving-unicode-problems-in-python-2-7/ 3. https://docs.python.org/2/howto/unicode.html 4. http://utf8everywhere.org/ Jauku dienu; miłego dnia; хорошего дня, Māris. Moved from trac ticket https://trac.osgeo.org/grass/ticket/2885 _______________________________________________ grass-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/grass-dev
