Re: [Python-3000] BOM handling

Antoine Pitrou Wed, 13 Sep 2006 13:33:29 -0700

Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
> And is generally ignored, as per unicode spec; it's a "zero width
> non-breaking space" - an invisible character with no effect on wrapping
> or otherwise.


Well it would be better if Py3K (with all strings unicode) makes things
easy for the programmer and abstracts away those "invisible characters
with no textual meaning". Currently it's not the case:

>>> a = "hello".decode("utf-8")
>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False

>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False
>>> a
u'hello'
>>> b
u'\ufeffhello'
>>> print a
hello
>>> print b
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in 
position 0: character maps to <undefined>


Regards

Antoine.


_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] BOM handling

Reply via email to