Hi Maris,

On 07/02/16 11:56, Maris Nartiss wrote:
Hello devs,
as you might already have noticed, there is a constant stream of
issues containing keywords "encoding" or more often
"UnicodeDecodeError". The main reason behind this is Python 2.x two
types of text strings - byte sequence (one you get with str()) and
Unicode (unicode()). Python 3.x will have only one - Unicode (byte
sequence is not a string any more) thus fixing this frustrating source
of errors.
Moving GRASS Python code to use Unicode internally will make it closer
to Python 3 ready and solve largest part of errors caused by implicit
conversation from encoded text strings to Unicode text strings.

I would be very happy if we could find a structural solution to this which would avoid having to deal with so many individual errors all the time.


The proposal is to make GRASS GIS Python code complaint with Unicode
best practice [1] following principle "decode early, encode late".
Things to change:
1) Any text string entering Python part of code should be decoded at
its entry point and decoded back to byte sequence at its exit point.
It also applies to all calls to GRASS modules passing around text;
2) Replace all text strings with Unicode literals (u'text'). No
exceptions. Note - "text strings" - thus byte sequences should not be
touched;
3) Ensure text file reading / writing is done via codecs.open;
4) Pass only Unicode to Python file handling calls (this is important
for running on MS-Windows);
5) Use Unicode in tests to ensure correctness of code;
6) Introduce information on Unicode usage into Python submitting
guidelines [2],[3].

Things to change outside of Python code:
1) Store attribute table encoding information along with connection parameters;
2) Ensure storage of correct encoding information on data import and
correct use on export (especially painful for ESRI Shapefiles);
3) Ensure correct encoding information in headers of all PO and XML files.

Expected problems:
1) When moving to Python 3, all explicit Unicode literal definitions
will need to be removed (u'text' -> 'text');
2) Introduction of "encode early" principle will break all of the
band-aids currently in place - a major breakage of code for a short
time is expected;
3) Guessing correct encoding can be a problem. One of solutions could
be checking early for correctness of system configuration and refusing
to operate on improperly configured systems. Fatal error is better
than silent data corruption (as it is happening at the moment for
certain scenarios).


I am no expert on this question, and thus do not have a clear opinion on your proposal, except for the fact that I'm very happy that it exists, but here are my intuitive ideas & questions on your topics:


Topic to discuss:
1) Implementation plan:
a) should it be done before 7.1?

I think the sooner, the better, so 7.1 should be our latest milestone (7.0.x should be in 'bugfix only mode).

b) should separate bugs be opened for parts of migration?

To what point can different issues be delimited into +/- autonomous issues ?

c) how big / long breakage is acceptable?

How complete would breakage be: for all encodings, or would LANG=C always work ?

Is this something which could be done for most part in a concentrated manner during a code sprint (e.g. FOSS4G 2016) ?

2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
pushing the encode/decode "boundary" further. Upside - most of
existing data is UTF-8 ready (parts supporting only ASCII) [4].

What do you mean with "text in GRASS location" ? How about files on the filesystem that some users might want to access via other tools ? Shouldn't they be in the system-wide encoding ?

Thank you very much for bringing up this discussion in such a structured manner. I hope that others will show some interest in the matter...

Moritz
_______________________________________________
grass-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/grass-dev

Reply via email to