Alexander, I'm cc'ing Gaige Paulsen as he proposed in http://trac.osgeo.org/gdal/ticket/3403 a patch with a similar approach to yours, that is to say provide a method at the OGRLayer level to return the encoding.
The more I think to this issue the more I recognize that the "UTF-8 everywhere internally" is probably not practical in all situations, or at least doesn't let enough control to the user. The UTF-8 as a pivot is - conceptually - OK for the read part, but it doesn't help for the write part when a driver doesn't support UTF-8 (or if for some compatibility reasons with other software, we must write data in a certain encoding) My main remark about your patch is I don't believe that the enum approach to list the encodings is the best one. I'd be rather in favor of using a string, and possibly sticking to the ones returned by 'iconv -l' so that we can easily use the return of GetEncoding() to feed it into the converter through CPLRecode(). I've experimented with it some time ago and have ready some changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in order to extend its scope beyond the current hardcoded support for UTF8 and ISO-8859-1. We could imagine a -s_encoding, -t_encoding and -a_encoding switches to ogr2ogr to let the user define the transcoding or encoding assignment. One of the difficulty raised by Gaige in #3403 is the meaning of the width attribute of an OGRFieldDefn object (number of bytes or number characters in a given encoding), and how/if it will be affected by an encoding change. The other issues raised by Gaige in his last comment are still worth considering. For the read part, what do we want ? : 1) that the driver returns the data in its "raw" encoding and mentions the encoding --> matches the approach of your proposal 2) that we ask it to return the data to UTF-8 when we don't care about the data in its source encoding 3) that we can override its encoding when the source encoding is believed to be incorrect so that 2) can work properly 1) and 2) approach are clearly following 2 differents tracks. One way to reconcile both would be to provide some configuration/opening option to choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates that "Any driver which knows it's encoding should convert to UTF-8." Well, probably not a big deal since that any change related to how we deal with encoding is likely to cause RFC23 to be amended anyway. Personnaly, I'm not sure about which one is the best. I'm wondering what the use cases for 1) are : when do we really want the data to be returned in its source encoding --> will not be it converted later to UTF-8 at the application level after the user has potentially selected/overriden the source encoding ? In which case 3) would solve the problem. Just thinking loud... For the write part, a OGRSFDriver::GetSupportedEncodings() and OGRLayer::SetEncoding() could make sense (for the later, if it must be exposed at the datasource or layer level is an open point and a slight difference between yours and Gaige's approach) Best regards Even _______________________________________________ gdal-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/gdal-dev
