On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert < [email protected]> wrote:
> Hi Maris, > > On 07/02/16 11:56, Maris Nartiss wrote: > >> Hello devs, >> as you might already have noticed, there is a constant stream of >> issues containing keywords "encoding" or more often >> "UnicodeDecodeError". The main reason behind this is Python 2.x two >> types of text strings - byte sequence (one you get with str()) and >> Unicode (unicode()). Python 3.x will have only one - Unicode (byte >> sequence is not a string any more) thus fixing this frustrating source >> of errors. >> Moving GRASS Python code to use Unicode internally will make it closer >> to Python 3 ready and solve largest part of errors caused by implicit >> conversation from encoded text strings to Unicode text strings. >> > > I would be very happy if we could find a structural solution to this which > would avoid having to deal with so many individual errors all the time. > > >> The proposal is to make GRASS GIS Python code complaint with Unicode >> best practice [1] following principle "decode early, encode late". >> Things to change: >> 1) Any text string entering Python part of code should be decoded at >> its entry point and decoded back to byte sequence at its exit point. >> It also applies to all calls to GRASS modules passing around text; >> 2) Replace all text strings with Unicode literals (u'text'). No >> exceptions. Note - "text strings" - thus byte sequences should not be >> touched; >> 3) Ensure text file reading / writing is done via codecs.open; >> 4) Pass only Unicode to Python file handling calls (this is important >> for running on MS-Windows); >> 5) Use Unicode in tests to ensure correctness of code; >> 6) Introduce information on Unicode usage into Python submitting >> guidelines [2],[3]. >> >> Things to change outside of Python code: >> 1) Store attribute table encoding information along with connection >> parameters; >> 2) Ensure storage of correct encoding information on data import and >> correct use on export (especially painful for ESRI Shapefiles); >> 3) Ensure correct encoding information in headers of all PO and XML files. >> >> Expected problems: >> 1) When moving to Python 3, all explicit Unicode literal definitions >> will need to be removed (u'text' -> 'text'); >> 2) Introduction of "encode early" principle will break all of the >> band-aids currently in place - a major breakage of code for a short >> time is expected; >> 3) Guessing correct encoding can be a problem. One of solutions could >> be checking early for correctness of system configuration and refusing >> to operate on improperly configured systems. Fatal error is better >> than silent data corruption (as it is happening at the moment for >> certain scenarios). >> >> > I am no expert on this question, and thus do not have a clear opinion on > your proposal, except for the fact that I'm very happy that it exists, but > here are my intuitive ideas & questions on your topics: I don't have a clear opinion either but I hoped Glynn could state his opinion here, because I understood he has a different view on some of these things. AFAIR, one of the problems is possibly different needs of Python scripting library vs. GUI. Anna > > > Topic to discuss: >> 1) Implementation plan: >> a) should it be done before 7.1? >> > > I think the sooner, the better, so 7.1 should be our latest milestone > (7.0.x should be in 'bugfix only mode). > > b) should separate bugs be opened for parts of migration? >> > > To what point can different issues be delimited into +/- autonomous issues > ? > > c) how big / long breakage is acceptable? >> > > How complete would breakage be: for all encodings, or would LANG=C always > work ? > > Is this something which could be done for most part in a concentrated > manner during a code sprint (e.g. FOSS4G 2016) ? > > 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus >> pushing the encode/decode "boundary" further. Upside - most of >> existing data is UTF-8 ready (parts supporting only ASCII) [4]. >> > > What do you mean with "text in GRASS location" ? How about files on the > filesystem that some users might want to access via other tools ? Shouldn't > they be in the system-wide encoding ? > > Thank you very much for bringing up this discussion in such a structured > manner. I hope that others will show some interest in the matter... > > Moritz > > _______________________________________________ > grass-dev mailing list > [email protected] > http://lists.osgeo.org/mailman/listinfo/grass-dev >
_______________________________________________ grass-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/grass-dev
