Hello, all!

I've recently encountered a bug in NumPy's string arrays, where the 00 ASCII character ('\x00') is not stored properly when put at the end of a string.

For example:

Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> print numpy.version.version
1.3.0
>>> arr = numpy.empty(1, 'S2')
>>> arr[0] = 'ab'
>>> arr
array(['ab'],
      dtype='|S2')
>>> arr[0] = 'c\x00'
>>> arr
array(['c'],
      dtype='|S2')

It seems that the string array is using the 00 character to pad strings smaller than the maximum size, and thus is treating any 00 characters at the end of a string as padding. Obviously, as long as I don't use smaller strings, there is no information lost here, but I don't want to have to re-add my 00s each time I ask the array what it is holding.

Is this a well-known bug already? I couldn't find it on the NumPy bug tracker, but I could have easily missed it, or it could be triaged, deemed acceptable because there's no better way to deal with arbitrary-length strings. Is there an easy way to avoid this problem? Pretty much any performance-intensive part of my program is going to be dealing with these arrays, so I don't want to just replace them with a slower dictionary instead.

I can't imagine this issue hasn't come up before; I encountered it by using NumPy arrays to store Python structs, something I can imagine is done fairly often. As such, I apologize for bringing it up again!

Nathaniel
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to