Hi Marko (and anyone else who is interested), I'm happy to take note of your 
comments on ANSI and UTF-16, however, I disagree with your assessment of the 
code being good at present. The OSM data files may well be in utf-8 format, but 
are not (usually) files generated by the user and they have an XML header 
indicating the encoding. The copyright and license files are text files 
provided as inputs by the user and hence have no  encoding information 
provided. I can see no reason to consider that the user generated text files 
should be expected to be utf-8 encoded just because the OSM data files are 
encoded that way. There is no mention anywhere of utf-8 in the documentation, 
and when a copyright or license file is not utf-8 encoded then we get an error 
message that doesn't explain why the file failed to load. This is a recipe for 
frustrated users and questions being asked about why files won't load. It seems 
to me that if only a single file format is to be loaded then mkgmap should 
expect it to be using the default code page, which is how a text editor will 
normally save the file, unless specifically asked to do otherwise. Of course 
some systems may well be set up with utf-8 as the default. Up until very 
recently, the copyright file was not expected to be in utf-8 format. I suggest 
that perhaps one of the following options might be the way to go:

1:
Load the two files using the default code page.
If there is a failure, include the reason for the failure in the exit exception 
message.

2:
Update the documentation to  indicate that utf-8 must be used for license and 
copyright file
If there is a failure, include the reason for the failure in the exit exception 
message.

3:
Use the existing --code-page option to also determine the code page for the 
copyright and license files. If not specified, use the default code page.
If there is a failure, include the reason for the failure in the exit exception 
message.

I am happy to rework the patch for any of the above, but will wait for further 
comments/feedback before proceeding.

Regards,
Mike

-----Original Message-----
From: Marko Mäkelä [mailto:[email protected]] 
Sent: 27 December 2016 20:30
To: Development list for mkgmap <[email protected]>
Subject: Re: [mkgmap-dev] Copyright & License file reader improvements

On Tue, Dec 27, 2016 at 06:07:11PM -0000, Mike Baggaley wrote:
>Hi Gerd, please find attached a small patch that improves the loading of
>copyright and license data when the --copyright-file and --license-file
>options are used. It will attempt to load the data using ANSI, UTF-8, UTF-16
>and the default code page. If it fails, more information is provided as to
>the reason why.

I am not Gerd, and I am not that active with mkgmap any more, but I have 
some interest in character encodings.

I had a quick look at the patch. It first tries ASCII (which is a proper 
subset of UTF-8), then UTF-8, UTF-16 and the default code page.

I do not think that there is any need to try ASCII separately. Any valid 
ASCII input is also valid UTF-8.

If the input is not valid UTF-8, things get tricky. I am not sure if 
UTF-16 is a good thing to try. Here is an example where 6 ASCII 
characters (which could be part of a non-ASCII, non-UTF-8 input) get 
misinterpreted as 3 Chinese glyphs in UTF-16:

$ echo -n foobar|recode utf16..utf8;echo
景潢慲

Because of this, I would omit the UTF-16 pass altogether. If UTF-16 
input is truly needed, the default code page could be set to it.

Also, some non-UTF-8 superset of ASCII could accidentally look like 
valid UTF-8. For example, the bytes 0xc2 0xa0 could represent the 
two-character string U+00C2 U+00A0 in ISO 8859-1. But the same bytes 
could also be interpreted as the single UTF-8 encoded character U+00A0.

I think that if multiple input formats are supported (which would be 
against the Unix philosophy of keeping programs simple), the selection 
must be explicit, by some command line switch that chooses to use the 
default code page instead of UTF-8.

In my opinion, the current code is good as it is. Because mkgmap already 
deals with mostly UTF-8 input (the OSM data), I think it is consistent 
to assume that all text files are encoded in UTF-8.

Best regards,

        Marko


_______________________________________________
mkgmap-dev mailing list
[email protected]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

Reply via email to