On Fri, Jun 24, 2011 at 8:57 AM, Keith Goodman <[email protected]> wrote:
> On Thu, Jun 23, 2011 at 3:24 PM, Mark Wiebe <[email protected]> wrote: > > On Thu, Jun 23, 2011 at 5:05 PM, Keith Goodman <[email protected]> > wrote: > >> > >> On Thu, Jun 23, 2011 at 1:53 PM, Mark Wiebe <[email protected]> wrote: > >> > Enthought has asked me to look into the "missing data" problem and how > >> > NumPy > >> > could treat it better. I've considered the different ideas of adding > >> > dtype > >> > variants with a special signal value and masked arrays, and concluded > >> > that > >> > adding masks to the core ndarray appears is the best way to deal with > >> > the > >> > problem in general. > >> > I've written a NEP that proposes a particular design, viewable here: > >> > > >> > > https://github.com/m-paradox/numpy/blob/cmaskedarray/doc/neps/c-masked-array.rst > >> > There are some questions at the bottom of the NEP which definitely > need > >> > discussion to find the best design choices. Please read, and let me > know > >> > of > >> > all the errors and gaps you find in the document. > >> > >> Wow, that is exciting. > >> > >> I wonder about the relative performance of the two possible > >> implementations (mask and NA) in the PEP. > > > > I've given that some thought, and I don't think there's a clear way to > tell > > what the performance gap would be without implementations of both to > > benchmark against each other. I favor the mask primarily because it > provides > > masking for all data types in one go with a single consistent interface > to > > program against. For adding NA signal values, each new data type would > need > > a lot of work to gain the same level of support. > >> > >> If you are, say, doing a calculation along the columns of a 2d array > >> one element at a time, then you will need to grab an element from the > >> array and grab the corresponding element from the mask. I assume the > >> corresponding data and mask elements are not stored together. That > >> would be slow since memory access is usually were time is spent. In > >> this regard NA would be faster. > > > > Yes, the masks add more memory traffic and some extra calculation, while > the > > NA signal values just require some additional calculations. > > I guess a better example would have been summing along rows instead of > columns of a large C order array. If one needs to look at both the > data and the mask then wouldn't summing along rows in cython be about > as slow as it is currently to sum along columns? > Not quite, both the mask and the array data are being traversed coherently, so it isn't jumping around in memory like in the columns case you're describing. > >> I currently use NaN as a missing data marker. That adds things like > >> this to my cython code: > >> > >> if a[i] == a[i]: > >> asum += a[i] > >> > >> If NA also had the property NA == NA is False, then it would be easy > >> to use. > > > > That's what I believe it should do, and I guess this is a strike against > the > > idea of returning None for a single missing value. > > If NA == NA is False then I wouldn't need to look at the mask in the > example above. Or would ndarray have to look at the mask in order to > return NA for a[i]? Which would mean __getitem__ would need to look at > the mask? > What R does is return NA for NA == NA. Then, if you try to use it as a boolean, it throws an exception. I like this approach. If the missing value is returned as a 0d array (so that NA == NA is > False), would that break cython in a fundamental way since it could > not always return a same-sized scalar when you index into an array? > I don't know enough about Cython internals to comment, sorry. -Mark >> A mask, on the other hand, would be more difficult for third > >> party packages to support. You have to check if the mask is present > >> and if so do a mask-aware calculation; if is it not present then you > >> have to do a non-mask based calculation. > > > > I actually see the mask as being easier for third party packages to > support, > > particularly from C. Having regular C-friendly values with a boolean mask > is > > a lot friendlier than values that require a lot of special casing like > the > > NA signal values would require. > > > >> > >> So you have two code paths. > >> You also need to check if any of the input arrays have masks and if so > >> apply masks to the other inputs, etc. > > > > Most of the time, the masks will transparently propagate or not along > with > > the arrays, with no effort required. In Python, the code you write would > be > > virtually the same between the two approaches. > > -Mark > > > >> > >> _______________________________________________ > >> NumPy-Discussion mailing list > >> [email protected] > >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [email protected] > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
