I'm experimenting with a user-defined "enumeration" dtype -- where the underlying array holds a set of integers, but they (mostly) appear to the user as strings. (This would be potentially useful for representing categorical data, modeling hdf5 enumerations, etc.) So for each set of enumerated values, I have one Python-level object that stores the mapping between strings and integers (an instance of the 'Enum' class), and a few Python-level objects that represent each of the enumerated values (instances of the 'EnumValue' class).
To map this into numpy, I've defined and registered a single custom dtype with 'EnumValue' as its typeobj, and then when I want to create an array of enumerations I make a copy of this registered dtype (with PyArray_DescrNewFromType) and stash a reference to the appropriate Enum object in its 'metadata' dictionary. Question 1: Is this overall approach -- of only calling PyArray_RegisterDataType once, and then sharing the resulting typenum among all my different dtype instances -- correct? It doesn't seem reasonable to register a new dtype for every different set of enumerations, because AFAICT this would create a memory leak (since you can't "unregister" a dtype). Anyway, that seems to be working okay, but then I wanted to teach "==" about my new dtype, so that I can compare to strings and such. So, I need to register my comparison function with the "np.equal" ufunc. Question 2: Am I missing something, or does the ufunc API make this impossible? The problem is that a "PyUFuncGenericFunction" doesn't have any way to find the dtypes of the arrays that it's working on. All of the PyArray_ArrFuncs functions take a pointer to the underlying ndarray as an argument, so that when working with a string or void array, you can find the actual dtype and figure out (e.g.) the size of the objects involved. But ufunc inner loops don't get that, so I guess it's just impossible to define a ufunc over variable-sized data, or data that you need to be able to see the dtype metadata to interpret? This seems easy enough to fix, and would probably allow the removal of a big pile of code in arrayobject.c that does special-case handling for "==" on strings and void arrays. (Another side-effect of the current special-case approach is that "==" and "np.equal" behave differently on arrays of strings.) But is there another option? Anyway, thanks for your help! I'll attach the code in case it helps (and because I'm not too confident I'm getting the details right!). Sample usage: >>> from npenum import *; import numpy as np >>> e = Enum(["low", "medium", "high"]) >>> a = np.array(["low", "low", "high", "medium"], dtype=e) >>> a array([low, low, high, medium], dtype=EnumValue) >>> a.view(Enum.get(a).inttype) array([0, 0, 2, 1], dtype=uint32) Thanks, -- Nathaniel
npenum.pyx
Description: Binary data
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
