On 05/06/17 19:40, Chris Barker wrote: > > If you ask me, passing a unicode string to fromstring with sep='' > (i.e. > to parse binary data) should ALWAYS raise an error: the semantics only > make sense for strings of bytes. > > > exactly -- we really should have a "frombytes()" alias for > fromstring() and it should only work for atual bytes objects (strings > on py2, naturally). > > and overloading fromstring() to mean both "binary dump of data" and > "parse the text" due to whether the sep argument is set was always a > bad idea :-( > > .. and fromstring(s, sep=a_sep_char)
As it happens, this is pretty much what stdlib bytearray does since 3.2 (http://bugs.python.org/issue8990) > > has been semi broken (or at least not robust) forever anyway. > > Currently, there appears to be some UTF-8 conversion going on, which > creates potentially unexpected results: > > >>> s = 'αβγδ' > >>> a = np.fromstring(s, 'u1') > >>> a > array([206, 177, 206, 178, 206, 179, 206, 180], dtype=uint8) > >>> assert len(a) * a.dtype.itemsize == len(s) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > AssertionError > >>> > > This is, apparently (https://github.com/numpy/numpy/issues/2152 > <https://github.com/numpy/numpy/issues/2152>), due to > how the internals of Python deal with unicode strings in C code, > and not > due to anything numpy is doing. > > > exactly -- py3 strings are pretty nifty implementation of unicode text > -- they have nothing to do with storing binary data, and should not be > used that way. There is essentially no reason you would ever want to > pass the actual binary representation to any other code. > > fromstring should be re-named frombytes, and it should raise an > exception if you pass something other than a bytes object (or maybe a > memoryview or other binary container?) > > we might want to keep fromstring() for parsing strings, but only if it > were fixed... > > IMHO calling fromstring(..., sep='') with a unicode string should be > deprecated and perhaps eventually forbidden. (Or fixed, but that would > break backwards compatibility) > > > agreed. > > > Python3 assumes 4-byte strings but in reality most of the time > > we deal with 1-byte strings, so there is huge waste of resources > > when dealing with 4-bytes. For many serious projects it is just > not needed. > > That's quite enough anglo-centrism, thank you. For when you need byte > strings, Python 3 has a type for that. For when your strings contain > text, bytes with no information on encoding are not enough. > > > There was a big thread about this recently -- it seems to have not > quite come to a conclusion. But anglo-centrism aside, there is > substantial demand for a "smaller" way to store mostly-ascii text. > > I _think_ the conversation was steering toward an encoding-specified > string dtype, so us anglo-centric folks could use latin-1 or utf-8. > > But someone would need to write the code. > > -CHB > > > There can be some convenience methods for ascii operations, > > like eg char.toupper(), but currently they don't seem to work > with integer > > arrays so why not make those potentially useful methots usable > > and make them work on normal integer arrays? > I don't know what you're doing, but I don't think numpy is > normally the > right tool for text manipulation... > > > I agree here. But if one were to add such a thing (vectorized string > operations) -- I'd think the thing to do would be to wrap (or port) > the python string methods. But it shoudl only work for actual string > dtypes, of course. > > note that another part of the discussion previously suggested that we > have a dtype that wraps a native python string object -- then you'd > get all for free. This is essentially an object array with strings in > it, which you can do now. > > -CHB > > > -- > > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov <mailto:chris.bar...@noaa.gov> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion -- Thomas Jollans m ☎ +31 6 42630259 e ✉ t...@tjol.eu _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion