On Thu, Jan 23, 2014 at 5:45 AM, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote: > On Wed, Jan 22, 2014 at 05:53:26PM -0800, Chris Barker - NOAA Federal wrote: >> On Jan 22, 2014, at 1:13 PM, Oscar Benjamin <oscar.j.benja...@gmail.com> >> wrote: >> >> > >> > It's not safe to stop removing the null bytes. This is how numpy determines >> > the length of the strings in a dtype='S' array. The strings are not >> > "fixed-width" but rather have a maximum width. >> >> Exactly--but folks have told us on this list that they want (and are) >> using the 'S' style for arbitrary bytes, NOT for text. In which case >> you wouldn't want to remove null bytes. This is more evidence that 'S' >> was designed to handle c-style one-byte-per-char strings, and NOT >> arbitrary bytes, and thus not to map directly to the py2 string type >> (you can store null bytes in a py2 string" > > You can store null bytes in a Py2 string but you normally wouldn't if it was > supposed to be text. > >> >> Which brings me back to my original proposal: properly map the 'S' >> type to the py3 data model, and maybe add some kind of fixed width >> bytes style of there is a use case for that. I still have no idea what >> the use case might be. >> > > There would definitely be a use case for a fixed-byte-width > bytes-representing-text dtype in record arrays to read from a binary file: > > dt = np.dtype([ > ('name', '|b8:utf-8'), > ('param1', '<i4'), > ('param2', '<i4') > ... > ]) > > with open('binaryfile', 'rb') as fin: > a = np.fromfile(fin, dtype=dt) > > You could also use this for ASCII if desired. I don't think it really matters > that utf-8 uses variable width as long as a too long byte string throws an > error (and does not truncate). > > For non 8-bit encodings there would have to be some way to handle endianness > without a BOM, but otherwise I think that it's always possible to pad with > zero > *bytes* (to a sufficiently large multiple of 4 bytes) when encoding and strip > null *characters* after decoding. i.e.: > > $ cat tmp.py > import encodings > > def test_encoding(s1, enc): > b = s1.encode(enc).ljust(32, b'\0') > s2 = b.decode(enc) > index = s2.find('\0') > if index != -1: > s2 = s2[:index] > assert s1 == s2, enc > > encodings_set = set(encodings.aliases.aliases.values()) > > for N, enc in enumerate(encodings_set): > try: > test_encoding('qwe', enc) > except LookupError: > pass > > print('Tested %d encodings without error' % N) > $ python3 tmp.py > Tested 88 encodings without error > >> > If the trailing nulls are not removed then you would get: >> > >> >>>> a[0] >> > b'a\x00\x00\x00\x00\x00\x00\x00\x00\x00' >> >>>> len(a[0]) >> > 9 >> > >> > And I'm sure that someone would get upset about that. >> >> Only if they are using it for text-which you "should not" do with py3. > > But people definitely are using it for text on Python 3. It should be > deprecated in favour of something new but breaking it is just gratuitous. > Numpy doesn't have the option to make a clean break with Python 3 precisely > because it needs to straddle 2.x and 3.x while numpy-based applications are > ported to 3.x. > >> > Some more oddities: >> > >> >>>> a[0] = 1 >> >>>> a >> > array([b'1', b'string', b'of', b'different', b'length', b'words'], >> > dtype='|S9') >> >>>> a[0] = None >> >>>> a >> > array([b'None', b'string', b'of', b'different', b'length', b'words'], >> > dtype='|S9') >> >> More evidence that this is a text type..... > > And the big one: > > $ python3 > Python 3.2.3 (default, Sep 25 2013, 18:22:43) > [GCC 4.6.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import numpy as np >>>> a = np.array(['asd', 'zxc'], dtype='S') # Note unicode strings >>>> a > array([b'asd', b'zxc'], > dtype='|S3') >>>> a[0] = 'qwer' # Unicode string again >>>> a > array([b'qwe', b'zxc'], > dtype='|S3') >>>> a[0] = 'Õscar' > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position > 0: ordinal not in range(128)
looks mostly like casting rules to me, which looks like ASCII based instead of an arbitrary encoding. >>> a = np.array(['asd', 'zxc'], dtype='S') >>> b = a.astype('U') >>> b[0] = 'Õscar' >>> a[0] = 'Õscar' Traceback (most recent call last): File "<pyshell#17>", line 1, in <module> a[0] = 'Õscar' UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> b array(['Õsc', 'zxc'], dtype='<U3') >>> b.astype('S') Traceback (most recent call last): File "<pyshell#19>", line 1, in <module> b.astype('S') UnicodeEncodeError: 'ascii' codec can't encode character '\xd5' in position 0: ordinal not in range(128) >>> b.view('S4') array([b'\xd5', b's', b'c', b'z', b'x', b'c'], dtype='|S4') >>> a.astype('U').astype('S') array([b'asd', b'zxc'], dtype='|S3') Josef > > The analogous behaviour was very deliberately removed from Python 3: > >>>> a[0] == 'qwe' > False >>>> a[0] == b'qwe' > True > > > Oscar > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion