On Mon, Apr 23, 2012 at 11:35 PM, Fernando Perez <[email protected]>wrote:
> On Mon, Apr 23, 2012 at 8:49 PM, Stéfan van der Walt <[email protected]> > wrote: > > If you are referring to the traditional concept of a fork, and not to > > the type we frequently make on GitHub, then I'm surprised that no one > > has objected already. What would a fork solve? To paraphrase the > > regexp saying: after forking, we'll simply have two problems. > > I concur with you here: github 'forks', yes, as many as possible! > Hopefully every one of those will produce one or more PRs :) But a > fork in the sense of a divergent parallel project? I think that would > only be indicative of a complete failure to find a way to make > progress here, and I doubt we're anywhere near that state. > > That forks are *possible* is indeed a valuable and important option in > open source software, because it means that a truly dysfunctional > original project team/direction can't hold a community hostage > forever. But that doesn't mean that full-blown forks should be > considered lightly, as they also carry enormous costs. > > I see absolutely nothing in the current scenario to even remotely > consider that a full-blown fork would be a good idea, and I hope I'm > right. It seems to me we're making progress on problems that led to > real difficulties last year, but from multiple parties I see signs > that give me reason to be optimistic that the project is getting > better, not worse. > > We certainly aren't there at the moment, but I can see us heading that way. But let's back up a bit. Numpy 1.6.0 came out just about 1 year ago. Since then datetime, NA, polynomial work, and various other enhancements have gone in along with some 280 bug fixes. The major technical problem blocking a 1.7 release is getting datetime working reliably on windows. So I think that is where the short term effort needs to be. Meanwhile, we are spending effort to get out a 1.6.2 just so people can work with a stable version with some of the bug fixes, and potentially we will spend more time and effort to pull out the NA code. In the future there may be a transition to C++ and eventually a break with the current ABI. Or not. There are at least two motivations that get folks to write code for open source projects, scratching an itch and money. Money hasn't been a big part of the Numpy picture so far, so that leaves scratching an itch. One of the attractions of Numpy is that it is a small project, BSD licensed, and not overburdened with governance and process. This makes scratching an itch not as difficult as it would be in a large project. If Numpy remains a small project but acquires the encumbrances of a big project much of that attraction will be lost. Momentum and direction also attracts people, but numpy is stalled at the moment as the whole NA thing circles around once again. What would I suggest as a way forward with the NA option. Let's take the issues. 1) Adding slots to PyArrayObject_fields. I don't think this is likely to be a problem unless someone's code passes the struct by value or uses assignment to initialize a statically allocated instance. I'm not saying no one does that, low level scientific code can contain all sorts of bizarre and astonishing constructs and it is also possible that these sort of things might turn up in an old FORTRAN program. The question here is whether to allow any changes at all, and I think we will have to in the future. Given that, consistent use of accessors will make later changes to the organization or implementation of the base structure transparent. Numpy itself now uses accessors for the heritage slots, but not for the new NA slots. So I suggest at a minimum adding accessors for the maskna_dtype, maskna_data, and maskna_strides. Of course, later removing these slots will still remain a problem. 2) NA. This breaks down into API and implementation issues. Personally, I think marking the NA stuff experimental leaves room to modify both and would prefer to go with what we have and change it into whatever looks best by modification through pull requests. This kicks the can down the road, but not so far that people sufficiently interested in working on the topic can't get modifications in. My own preferences for future API modifications are as follows. a) All arrays should be implicitly masked, even if the mask isn't initially allocated. The maskna keyword can then be removed, taking with it the sense that there are two kinds of arrays. b) There needs to be a distinction between missing and ignore. The mechanism for this is already in place in the payload type, although it isn't clear to me that that is uniformly used in all the NA code. There is also a place for missing *and* ignored. Which leads to c) Sums, etc. should always skip ignored data. If missing data is present, but not ignored, then a sum should return NA. The main danger I see here is that the behavior of arrays becomes state dependent, something that can lead to subtle problems. Explicit request for a particular behavior, as is done now, might be preferable for its clarity. d) I think views are a good way add another mask layer to existing arrays. And for implementation: a) Ufunc loop support. This is most easily done with explicit masks. b) Apropos a), I'm coming (again) to the opinion that byte masks are the simplest and most general implementation. Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
