I had a go at this but just getting a reproducible test was not so easy. I 
think the current behavior is OK. The issues I noticed could be attributed to 
JVM misconfiguration. JSON data on disk should arguably be stored in UTF8 
(since it’s JSON), it’s just that the JVM on Windows assumes everything is 
windows1252 (unless told otherwise), hence our problem.

-Tobias

Från: Ian Turton <ijtur...@gmail.com>
Skickat: den 10 maj 2022 19:58
Till: Tobias Gerdin <tobias.ger...@havochvatten.se>
Kopia: Geotools-Devel list <geotools-devel@lists.sourceforge.net>
Ämne: Re: [Geotools-devel] gt-geojsondatastore GeoJSONReader should specify 
encoding as UTF-8?

I just assumed that everything was going to be utf8. Happy to review a pull 
request with a test.

Ian

On Tue, 10 May 2022, 13:41 Tobias Gerdin, 
<tobias.ger...@havochvatten.se<mailto:tobias.ger...@havochvatten.se>> wrote:
Hello,


I was puzzled by the behaviour of org.geotools.data.geojson.GeoJSONReader when 
I was using it to read a feature collection containing non-ascii strings. It 
complains that the JSON string contains invalid UTF-8 data.


Due to client mandate I need to develop on a Windows 11 machine. The default 
platform encoding is windows-1252 (for archeological reasons, I guess), not 
UTF8. I noticed that GeoJSONReader uses plain String.getBytes() to read the 
JSON data 
(https://github.com/geotools/geotools/blob/f416fcc3763b2db020c54a9323601fbdd49388e7/modules/unsupported/geojson-core/src/main/java/org/geotools/data/geojson/GeoJSONReader.java#L179<https://url11.mailanyone.net/v1/?m=1noU7V-0005wd-6C&i=57e1b682&c=nd0BHN18lI5vvZyhJeZSul8QCsK7EjzqVxFVLS2HSnuWzQCPdExUmmZjNsftJZCHkAw3hTGWYgnnba9mYVF9T5M448udpKgER6NJW5_vcJ_JidCPAKNOxNbTcXoxMOLph80MgSLX4zYdDI2dDTAyWQe8kvVM4seqem0owGeUgtjFOhBMYXOEMCx0TF2tE2MId438iJ0CQM-5D-PsvptlbdX_WOR1OXabMtUzfAlZpiwiD8Q28DHoj52O6Xd7ejb2RGDXkGwTD1OZJL6r7595YAl3MsrjK1v4vBuK_NQ3UCqrbGuoJFYtZOgOEO1oDRDEkjyFsyVvf872aCYSh89sJaV3w191WBq3wmwWO_jqmEdluG5Z3hVXiy9aL7Fxpxa0vsziD2_7TSjc2uvk2kOUVe_Q4WtG8IrXoYa2vfo_CZk>).


When I change the JVM charset encoding (which needs to be done at startup) 
using -Dfile.encoding=”UTF-8” my code works, but I rather not have to do this. 
I am not an expert on JSON but I recall the spec mandates that JSON data is 
encoded in UTF-8. So I believe that the above linked line should do 
jsonString.getBytes(StandardCharsets.UTF_8) (and in all other locations where 
JSON data is read).

Apparently Java is slated to go UTF-8 by default in the future, but until then 
we need to deal with this mess I guess.

Tobias Gerdin
Systemutvecklare, Konsult
Enheten för systemutveckling

Gullbergs Strandgata 15, 411 04 Göteborg
Box 11930, SE-404 39 Göteborg
tobias.ger...@havochvatten.se<mailto:tobias.ger...@havochvatten.se>
www.havochvatten.se<https://www.havochvatten.se/>
Havs- och vattenmyndigheten behandlar dina personuppgifter i enlighet med 
dataskyddsförordningen och myndighetens dataskyddspolicy, läs mer på 
www.havochvatten.se/sa-behandlar-hav-dina-personuppgifter<https://www.havochvatten.se/sa-behandlar-hav-dina-personuppgifter>

SwAM processes your personal data in accordance with the General Data 
Protection Regulation (GDPR) and our Data Protection Policy, see 
www.havochvatten.se/sa-behandlar-hav-dina-personuppgifter<https://www.havochvatten.se/sa-behandlar-hav-dina-personuppgifter>

_______________________________________________
GeoTools-Devel mailing list
GeoTools-Devel@lists.sourceforge.net<mailto:GeoTools-Devel@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/geotools-devel<https://url11.mailanyone.net/v1/?m=1noU7V-0005wd-6C&i=57e1b682&c=eSColVShyw2qqIHmNo0FJvvRFmDXQHdhDf1owtnjIFXQKG7glkWUMrZgvan3f0c4bPu23ihiJwC5ZMsGoyGFBismOfkDR-DkQsgwKVFsYfVq4RHbS6tBLsmqndc6kAzOTS5OEmZKJgFdK-UFwuPilR1H89mjHHbePQ7hfx_mwuUHMk2gclP8D2wI6gBKItdEz8_suRy-IvZcW7G9Qnj06AdYxGUfU0sNWmKZfvqqMcfLtS0qZ6yUEmGt11OtvuCD>
_______________________________________________
GeoTools-Devel mailing list
GeoTools-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geotools-devel

Reply via email to