On Thu, Jan 23, 2014 at 12:13 PM, <josef.p...@gmail.com> wrote: > On Thu, Jan 23, 2014 at 11:58 AM, <josef.p...@gmail.com> wrote: >> On Thu, Jan 23, 2014 at 11:43 AM, Oscar Benjamin >> <oscar.j.benja...@gmail.com> wrote: >>> On Thu, Jan 23, 2014 at 11:23:09AM -0500, josef.p...@gmail.com wrote: >>>> >>>> another curious example, encode utf-8 to latin-1 bytes >>>> >>>> >>> b >>>> array(['Õsc', 'zxc'], >>>> dtype='<U3') >>>> >>> b[0].encode('utf8') >>>> b'\xc3\x95sc' >>>> >>> b[0].encode('latin1') >>>> b'\xd5sc' >>>> >>> b.astype('S') >>>> Traceback (most recent call last): >>>> File "<pyshell#40>", line 1, in <module> >>>> b.astype('S') >>>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in >>>> position 0: ordinal not in range(128) >>>> >>> c = b.view('S4').astype('S1').view('S3') >>>> >>> c >>>> array([b'\xd5sc', b'zxc'], >>>> dtype='|S3') >>>> >>> c[0].decode('latin1') >>>> 'Õsc' >>> >>> Okay, so it seems that .view() implicitly uses latin-1 whereas .astype() >>> uses >>> ascii: >>> >>>>>> np.array(['Õsc']).astype('S4') >>> Traceback (most recent call last): >>> File "<stdin>", line 1, in <module> >>> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position >>> 0: ordinal not in range(128) >>>>>> np.array(['Õsc']).view('S4') >>> array([b'\xd5', b's', b'c'], >>> dtype='|S4') >> >> >> No, a view doesn't change the memory, it just changes the >> interpretation and there shouldn't be any conversion involved. >> astype does type conversion, but it goes through ascii encoding which fails. >> >>>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>>>> b.tostring() >> b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>>>> b.view('S12') >> array([b'\xd5\x00\x00\x00s\x00\x00\x00c', b'z\x00\x00\x00x\x00\x00\x00c'], >> dtype='|S12') >> >> The conversion happens somewhere in the array creation, but I have no >> idea about the memory encoding for uc2 and the low level layouts.
>>> b = np.array(['Õsc', 'zxc'], dtype='<U3') >>> b[0].tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' >>> 'Õsc'.encode('utf-32LE') b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00' Is that the encoding for 'U' ? --- another sideeffect of null truncation: cannot decode truncated data >>> b.view('S4').tostring() b'\xd5\x00\x00\x00s\x00\x00\x00c\x00\x00\x00z\x00\x00\x00x\x00\x00\x00c\x00\x00\x00' >>> b.view('S4')[0] b'\xd5' >>> b.view('S4')[0].tostring() b'\xd5' >>> b.view('S4')[:1].tostring() b'\xd5\x00\x00\x00' >>> b.view('S4')[0].decode('utf-32LE') Traceback (most recent call last): File "<pyshell#101>", line 1, in <module> b.view('S4')[0].decode('utf-32LE') File "C:\Programs\Python33\lib\encodings\utf_32_le.py", line 11, in decode return codecs.utf_32_le_decode(input, errors, True) UnicodeDecodeError: 'utf32' codec can't decode byte 0xd5 in position 0: truncated data >>> b.view('S4')[:1].tostring().decode('utf-32LE') 'Õ' numpy arrays need a decode and encode method Josef > > utf8 encoded bytes > >>>> a = np.array(['Õsc'.encode('utf8'), 'zxc'], dtype='S') >>>> a > array([b'\xc3\x95sc', b'zxc'], > dtype='|S4') >>>> a.tostring() > b'\xc3\x95sczxc\x00' >>>> a.view('S8') > array([b'\xc3\x95sczxc'], > dtype='|S8') > >>>> a[0].decode('latin1') > 'Ã\x95sc' >>>> a[0].decode('utf8') > 'Õsc' > > Josef > >> >> Josef >> >>> >>>> -------- >>>> The original numpy py3 conversion used latin-1 as default >>>> (It's still used in statsmodels, and I haven't looked at the structure >>>> under the common py2-3 codebase) >>>> >>>> if sys.version_info[0] >= 3: >>>> import io >>>> bytes = bytes >>>> unicode = str >>>> asunicode = str >>> >>> These two functions are an abomination: >>> >>>> def asbytes(s): >>>> if isinstance(s, bytes): >>>> return s >>>> return s.encode('latin1') >>>> def asstr(s): >>>> if isinstance(s, str): >>>> return s >>>> return s.decode('latin1') >>> >>> >>> Oscar >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion