On 09.01.2014 22:45, Antoine Pitrou wrote:
> On Thu, 9 Jan 2014 13:36:05 -0800
> Chris Barker <chris.bar...@noaa.gov> wrote:
>>
>> Some folks have suggested using latin-1 (or other 8-bit encoding) -- is
>> that guaranteed to work with any binary data, and round-trip accurately?
> 
> Yes, it is.

Just a word of caution:

Using the 'latin-1' to mean unknown encoding can easily result
in Mojibake (unreadable text) entering your application with
dangerous effects on your other text data.

E.g. "Marc-André" read using 'latin-1' if the string itself
is encoded as UTF-8 will give you "Marc-André" in your
application. (Yes, I see that a lot in applications
and websites I use ;-))

Also note that indexing based on code points will likely
break that way as well, ie. if you pass an index to an
application based on what you see in your editor or
shell, those indexes can be wrong when used on the
encoded data. UTF-8 is an example of a popular variable
length encoding for Unicode, so you'll hit this problem
whenever dealing with non-ASCII UTF-8 data.

>> and will surrogateescape work for arbitrary binary data?
> 
> Yes, it will.

The surrogateescape trick only works if you are encoding
your work using the same encoding that you used for decoding
it. Otherwise, you'll get a mix of the input encoding and the
output encoding as output.

Note that the error handler trick has an advantage over the
latin-1 trick: if you try to encode a Unicode string
with escape surrogates without using the error handler,
it will fail, so you at least know that there are "funny"
code points in your output string that need some extra
care.

BTW: Perhaps it would be a good idea to backport the
surrogateescape error handler to Python 2.7 to simplify
writing code which works in both Python 2 and 3.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 10 2014)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to