On Sat, Jun 25, 2011 at 9:44 AM, Wes McKinney <[email protected]> wrote:
> On Sat, Jun 25, 2011 at 10:25 AM, Charles R Harris > <[email protected]> wrote: > > On Sat, Jun 25, 2011 at 8:14 AM, Wes McKinney <[email protected]> > wrote: > >> > >> On Sat, Jun 25, 2011 at 12:42 AM, Charles R Harris > >> <[email protected]> wrote: > >> > > >> > > >> > On Fri, Jun 24, 2011 at 10:06 PM, Wes McKinney <[email protected]> > >> > wrote: > >> >> > >> >> On Fri, Jun 24, 2011 at 11:59 PM, Nathaniel Smith <[email protected]> > >> >> wrote: > >> >> > On Fri, Jun 24, 2011 at 6:57 PM, Benjamin Root <[email protected]> > >> >> > wrote: > >> >> >> On Fri, Jun 24, 2011 at 8:11 PM, Nathaniel Smith <[email protected]> > >> >> >> wrote: > >> >> >>> This is a situation where I would just... use an array and a > mask, > >> >> >>> rather than a masked array. Then lots of things -- changing fill > >> >> >>> values, temporarily masking/unmasking things, etc. -- come from > >> >> >>> free, > >> >> >>> just from knowing how arrays and boolean indexing work? > >> >> >> > >> >> >> With a masked array, it is "for free". Why re-invent the wheel? > It > >> >> >> has > >> >> >> already been done for me. > >> >> > > >> >> > But it's not for free at all. It's an additional concept that has > to > >> >> > be maintained, documented, and learned (with the last cost, which > is > >> >> > multiplied by the number of users, being by far the greatest). It's > >> >> > not reinventing the wheel, it's saying hey, I have wheels and > axles, > >> >> > but what I really need the library to provide is a wheel+axle > >> >> > assembly! > >> >> > >> >> You're communicating my argument better than I am. > >> >> > >> >> >>> Do we really get much advantage by building all these complex > >> >> >>> operations in? I worry that we're trying to anticipate and write > >> >> >>> code > >> >> >>> for every situation that users find themselves in, instead of > just > >> >> >>> giving them some simple, orthogonal tools. > >> >> >>> > >> >> >> > >> >> >> This is the danger, and which is why I advocate retaining the > >> >> >> MaskedArray > >> >> >> type that would provide the high-level "intelligent" operations, > >> >> >> meanwhile > >> >> >> having in the core the basic data structures for pairing a mask > >> >> >> with > >> >> >> an > >> >> >> array, and to recognize a special np.NA value that would act upon > >> >> >> the > >> >> >> mask > >> >> >> rather than the underlying data. Users would get very basic > >> >> >> functionality, > >> >> >> while the MaskedArray would continue to provide the interface that > >> >> >> we > >> >> >> are > >> >> >> used to. > >> >> > > >> >> > The interface as described is quite different... in particular, all > >> >> > aggregate operations would change their behavior. > >> >> > > >> >> >>> As a corollary, I worry that learning and keeping track of how > >> >> >>> masked > >> >> >>> arrays work is more hassle than just ignoring them and writing > the > >> >> >>> necessary code by hand as needed. Certainly I can imagine that > *if > >> >> >>> the > >> >> >>> mask is a property of the data* then it's useful to have tools to > >> >> >>> keep > >> >> >>> it aligned with the data through indexing and such. But some of > >> >> >>> these > >> >> >>> other things are quicker to reimplement than to look up the docs > >> >> >>> for, > >> >> >>> and the reimplementation is easier to read, at least for me... > >> >> >> > >> >> >> What you are advocating is similar to the "tried-n-true" coding > >> >> >> practice of > >> >> >> Matlab users of using NaNs. You will hear from Matlab programmers > >> >> >> about how > >> >> >> it is the greatest idea since sliced bread (and I was one of > them). > >> >> >> Then I > >> >> >> was introduced to Numpy, and I while I do sometimes still do the > NaN > >> >> >> approach, I realized that the masked array is a "better" way. > >> >> > > >> >> > Hey, no need to go around calling people Matlab programmers, you > >> >> > might > >> >> > hurt someone's feelings. > >> >> > > >> >> > But seriously, my argument is that every abstraction and new > concept > >> >> > has a cost, and I'm dubious that the full masked array abstraction > >> >> > carries its weight and justifies this cost, because it's highly > >> >> > redundant with existing abstractions. That has nothing to do with > how > >> >> > tried-and-true anything is. > >> >> > >> >> +1. I think I will personally only be happy if "masked array" can be > >> >> implemented while incurring near-zero cost from the end user > >> >> perspective. If what we end up with is a faster implementation of > >> >> numpy.ma in C I'm probably going to keep on using NaN... That's why > >> >> I'm entirely insistent that whatever design be dogfooded on > non-expert > >> >> users. If it's very much harder / trickier / nuanced than R, you will > >> >> have failed. > >> >> > >> > > >> > This sounds unduly pessimistic to me. It's one thing to suggest > >> > different > >> > approaches, another to cry doom and threaten to go eat worms. And all > >> > before > >> > the code is written, benchmarks run, or trial made of the usefulness > of > >> > the > >> > approach. Let us see how things look as they get worked out. Mark has > a > >> > good > >> > track record for innovative tools and I'm rather curious myself to see > >> > what > >> > the result is. > >> > > >> > Chuck > >> > > >> > > >> > _______________________________________________ > >> > NumPy-Discussion mailing list > >> > [email protected] > >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > >> > > >> > > >> > >> I hope you're right. So far it seems that anyone who has spent real > >> time with R (e.g. myself, Nathaniel) has expressed serious concerns > >> about the masked approach. And we got into this discussion at the Data > >> Array summit in Austin last month because we're trying to make Python > >> more competitive with R viz statistical and financial applications. > >> I'm just trying to be (R)ealistic =P Remember that I very earnestly am > >> doing everything I can these days to make scientific Python more > >> successful in finance and statistics. One big difference with R's > >> approach is that we care more about performance the the R community > >> does. So maybe having special NA values will be prohibitive for that > >> reason. > >> > >> Mark indeed has a fantastic track record and I've been extremely > >> impressed with his NumPy work, so I've no doubt he'll do a good job. I > >> just hope that you don't push aside my input-- my opinions are formed > >> entirely based on my domain experience. > >> > > > > I think what we really need to see are the use cases and work flow. The > ones > > that hadn't occurred to me before were memory mapped files and data > stored > > on disk in general. I think we may need some standard format for masked > data > > on disk if we don't go the NA value route. > > > > Chuck > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > [email protected] > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > Here are some things I can think of that would be affected by any changes > here > > 1) Right now users of pandas can type pandas.isnull(series[5]) and > that will yield True if the value is NA for any dtype. This might be > hard to support in the masked regime > I think this would map to np.ismissing(series[5]). What you want probably depends on whether series[5] represents a single value, a struct dtype value, or is itself an array. > 2) Functions like {Series, DataFrame}.fillna would hopefully look just > like this: > > # value is 0 or some other value to fill > new_series = self.copy() > new_series[isnull(new_series)] = value > That should work fine, yes. Keep in mind that people will write custom NA handling logic. So they might > do: > > series[isnull(other_series) & isnull(other_series2)] = val > > 3) Nulling / NA-ing out data is very common > > # null out this data up to and including date1 in these three columns > frame.ix[:date1, [col1, col2, col3]] = NaN > With np.NA instead of NaN, I think it would give what you want. > > # But this should work fine too > frame.ix[:date1, [col1, col2, col3]] = 0 > Under the hood, this would be unmasking and setting the appropriate values. > I'll try to think of some others. The main thing is that the NA value > is very easy to think about and fits in naturally with how people (at > least statistical / financial users) think about and work with data. > If you have to say "I have to set these mask locations to True" it > introduces additional mental effort compared with "I'll just set these > values to NA" > This is exactly what I mean when I'm talking about implementation details versus interface choices. With enough use cases like you've given here, I'm hoping to get that interface right. -Mark > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
