On Thu, Jan 23, 2014 at 10:41 AM, <josef.p...@gmail.com> wrote: > On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin > <oscar.j.benja...@gmail.com> wrote: >> On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: >>> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <oscar.j.benja...@gmail.com> >>> wrote: >>> >>> > >>> > It's not safe to stop removing the null bytes. This is how numpy >>> > determines >>> > the length of the strings in a dtype='S' array. The strings are not >>> > "fixed-width" but rather have a maximum width. >>> >>> Exactly--but folks have told us on this list that they want (and are) >>> using the 'S' style for arbitrary bytes, NOT for text. In which case >>> you wouldn't want to remove null bytes. This is more evidence that 'S' >>> was designed to handle c-style one-byte-per-char strings, and NOT >>> arbitrary bytes, and thus not to map directly to the py2 string type >>> (you can store null bytes in a py2 string" >> >> You can store null bytes in a Py2 string but you normally wouldn't if it was >> supposed to be text. >> >>> >>> Which brings me back to my original proposal: properly map the 'S' >>> type to the py3 data model, and maybe add some kind of fixed width >>> bytes style of there is a use case for that. I still have no idea what >>> the use case might be. >>> >> >> There would definitely be a use case for a fixed-byte-width >> bytes-representing-text dtype in record arrays to read from a binary file: >> >> dt = np.dtype([ >> ('name', '|b8:utf-8'), >> ('param1', '<i4'), >> ('param2', '<i4') >> ... >> ]) >> >> with open('binaryfile', 'rb') as fin: >> a = np.fromfile(fin, dtype=dt) >> >> You could also use this for ASCII if desired. I don't think it really matters >> that utf-8 uses variable width as long as a too long byte string throws an >> error (and does not truncate). >> >> For non 8-bit encodings there would have to be some way to handle endianness >> without a BOM, but otherwise I think that it's always possible to pad with >> zero >> *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip >> null *characters* after decoding. i.e.: >> >> $ cat tmp.py >> import encodings >> >> def test_encoding(s1, enc): >> b = s1.encode(enc).ljust(32, b'\0') >> s2 = b.decode(enc) >> index = s2.find('\0') >> if index != -1: >> s2 = s2[:index] >> assert s1 == s2, enc >> >> encodings_set = set(encodings.aliases.aliases.values()) >> >> for N, enc in enumerate(encodings_set): >> try: >> test_encoding('qwe', enc) >> except LookupError: >> pass >> >> print('Tested %d encodings without error' % N) >> $ python3 tmp.py >> Tested 88 encodings without error >> >>> > If the trailing nulls are not removed then you would get: >>> > >>> >>>> a[0] >>> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >>> >>>> len(a[0]) >>> > 9 >>> > >>> > And I'm sure that someone would get upset about that. >>> >>> Only if they are using it for text-which you "should not" do with py3. >> >> But people definitely are using it for text on Python 3. It should be >> deprecated in favour of something new but breaking it is just gratuitous. >> Numpy doesn't have the option to make a clean break with Python 3 precisely >> because it needs to straddle 2.x and 3.x while numpy-based applications are >> ported to 3.x. >> >>> > Some more oddities: >>> > >>> >>>> a[0] = 1 >>> >>>> a >>> > array([b'1', b'string', b'of', b'different', b'length', b'words'], >>> > dtype='|S9') >>> >>>> a[0] = None >>> >>>> a >>> > array([b'None', b'string', b'of', b'different', b'length', b'words'], >>> > dtype='|S9') >>> >>> More evidence that this is a text type..... >> >> And the big one: >> >> $ python3 >> Python 3.2.3 (default, Sep 25 2013, 18:22:43) >> [GCC 4.6.3] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >>>>> import numpy as np >>>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings >>>>> a >> array([b'asd', b'zxc'], >> dtype='|S3') >>>>> a[0] = 'qwer' # Unicode string again >>>>> a >> array([b'qwe', b'zxc'], >> dtype='|S3') >>>>> a[0] = 'Õscar' >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position >> 0: ordinal not in range(128) > > looks mostly like casting rules to me, which looks like ASCII based > instead of an arbitrary encoding. > >>>> a = np.array(['asd', 'zxc'], dtype='S') >>>> b = a.astype('U') >>>> b[0] = 'Õscar' >>>> a[0] = 'Õscar' > Traceback (most recent call last): > File "<pyshell#17>", line 1, in <module> > a[0] = 'Õscar' > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in > position 0: ordinal not in range(128) >>>> b > array(['Õsc', 'zxc'], > dtype='<U3') >>>> b.astype('S') > Traceback (most recent call last): > File "<pyshell#19>", line 1, in <module> > b.astype('S') > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in > position 0: ordinal not in range(128) >>>> b.view('S4') > array([b'\xd5', b's', b'c', b'z', b'x', b'c'], > dtype='|S4') > >>>> a.astype('U').astype('S') > array([b'asd', b'zxc'], > dtype='|S3')
another curious example, encode utf-8 to latin-1 bytes >>> b array(['Õsc', 'zxc'], dtype='<U3') >>> b[0].encode('utf8') b'\xc3\x95sc' >>> b[0].encode('latin1') b'\xd5sc' >>> b.astype('S') Traceback (most recent call last): File "<pyshell#40>", line 1, in <module> b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> c = b.view('S4').astype('S1').view('S3') >>> c array([b'\xd5sc', b'zxc'], dtype='|S3') >>> c[0].decode('latin1') 'Õsc' -------- The original numpy py3 conversion used latin-1 as default (It's still used in statsmodels, and I haven't looked at the structure under the common py2-3 codebase) if sys.version_info[0] >= 3: import io bytes = bytes unicode = str asunicode = str def asbytes(s): if isinstance(s, bytes): return s return s.encode('latin1') def asstr(s): if isinstance(s, str): return s return s.decode('latin1') -------------- Josef > > Josef > >> >> The analogous behaviour was very deliberately removed from Python 3: >> >>>>> a[0] == 'qwe' >> False >>>>> a[0] == b'qwe' >> True >> >> >> Oscar >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion