On Aug 22, 2010, at 4:36 PM, Nathaniel Smith wrote: > I'm experimenting with a user-defined "enumeration" dtype -- where the > underlying array holds a set of integers, but they (mostly) appear to > the user as strings. (This would be potentially useful for > representing categorical data, modeling hdf5 enumerations, etc.) So > for each set of enumerated values, I have one Python-level object that > stores the mapping between strings and integers (an instance of the > 'Enum' class), and a few Python-level objects that represent each of > the enumerated values (instances of the 'EnumValue' class). > > To map this into numpy, I've defined and registered a single custom > dtype with 'EnumValue' as its typeobj, and then when I want to create > an array of enumerations I make a copy of this registered dtype (with > PyArray_DescrNewFromType) and stash a reference to the appropriate > Enum object in its 'metadata' dictionary.
> > Question 1: Is this overall approach -- of only calling > PyArray_RegisterDataType once, and then sharing the resulting typenum > among all my different dtype instances -- correct? It doesn't seem > reasonable to register a new dtype for every different set of > enumerations, because AFAICT this would create a memory leak (since > you can't "unregister" a dtype). Yes, this is the approach to take. > > Anyway, that seems to be working okay, but then I wanted to teach "==" > about my new dtype, so that I can compare to strings and such. So, I > need to register my comparison function with the "np.equal" ufunc. > > Question 2: Am I missing something, or does the ufunc API make this > impossible? The problem is that a "PyUFuncGenericFunction" doesn't > have any way to find the dtypes of the arrays that it's working on. > All of the PyArray_ArrFuncs functions take a pointer to the underlying > ndarray as an argument, so that when working with a string or void > array, you can find the actual dtype and figure out (e.g.) the size of > the objects involved. But ufunc inner loops don't get that, so I guess > it's just impossible to define a ufunc over variable-sized data, or > data that you need to be able to see the dtype metadata to interpret? Yes, currently that is correct. Variable data-types don't work in ufuncs for some subtle reasons. But, the changes that allow date-times to work also fix (or will fix) this problem as a side-effect. The necessary changes to ufuncs have not been made, yet, however. They are planned. And, yes, this would allow ufuncs to be used for string equality testing, etc. Thanks, -Travis _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
