On Wed, Jun 4, 2014 at 7:18 AM, Travis Oliphant <[email protected]> wrote: > Even relatively simple changes can have significant impact at this point. > Nathaniel has laid out a fantastic list of great features. These are the > kind of features I have been eager to see as well. This is why I have been > working to fund and help explore these ideas in the Numba array object as > well as in Blaze. Gnumpy, Theano, Pandas, and other projects also have > useful tales to tell regarding a potential NumPy 2.0.
I think this is somewhat missing the main point of my message :-). I was specifically laying out a list of features that we could start working on *right now*, *without* waiting for the mythical "numpy 2.0". > Ultimately, I do think it is time to talk seriously about NumPy 2.0, and > what it might look like. I personally think it looks a lot more like a > re-write, than a continuation of the modifications of Numeric that became > NumPy 1.0. Right out of the gate, for example, I would make sure that > NumPy 2.0 objects somehow used PyObject_VAR_HEAD so that they were > variable-sized objects where the strides and dimension information was > stored directly in the object structure itself instead of allocated > separately (thus requiring additional loads and stores from memory). This > would be a relatively simple change. But, it can't be done and preserve ABI > compatibility. It may also, at this point, have impact on Cython code, or > other code that is deeply-aware of the NumPy code-structure. Some of the > changes that should be made will ultimately require a porting exercise for > new code --- at which point why not just use a new project. I'm not aware of any obstacles to packing strides/dimension/data into the ndarray object right now, tomorrow if you like -- we've even discussed doing this recently in the tracker. PyObject_VAR_HEAD in particular seems... irrelevant? All it is is syntactic sugar for adding an integer field called "ob_size" to a Python object struct, plus a few macros for working with this field. We don't need or want such a field anyway (for shape/strides it would be redundant with ndim), and even if we did want such a field we could add it any time without breaking ABI. And if someday we do discover some compelling advantage to breaking ABI by rearranging the ndarray struct, then we can do this with a bit of planning by using #ifdef's to make the rearrangement coincide with a new Python release. E.g., people building against python 3.5 get the new struct layout, people building against 3.4 get the old, and in a few years we drop support for the old. No compatibility breaks needed, never mind rewrites. More generally: I wouldn't rule out "numpy 2.0" entirely, but we need to remember the immense costs that a rewrite-and-replace strategy will incur. Writing a new library is very expensive, so that's one cost. But that cost is nothing compared to the costs of getting that new library to the same level of maturity that numpy has already reached. And those costs, in turn, are absolutely dwarfed by the transition costs of moving the whole ecosystem from one foundation to a different, incompatible one. And probably even these costs are small compared to the opportunity costs -- all the progress that *doesn't* get made in the mean time because fragmented ecosystems suck and make writing code hard, and the best hackers are busy porting code instead of writing awesome new stuff. I'm sure dynd is great, but we have to be realistic: the hard truth is that even if it's production-ready today, that only brings us a fraction of a fraction of a percent closer to making it a real replacement for numpy. Consider the python 2 to python 3 transition: Python 3 itself was an immense amount of work for a large number of people, with intense community scrutiny of the design. It came out in 2008. 6 years and many many improvements later, it's maybe sort-of starting to look like a plurality of users might start transitioning soonish? It'll be years yet before portable libraries can start taking advantage of python 3's new awesomeness. And in the mean time, the progress of the whole Python ecosystem has been seriously disrupted: think of how much awesome stuff we'd have if all the time that's been spent porting and testing different packages had been spent on moving them forward instead. We also have experience closer to home -- did anyone enjoy the numeric/numarray->numpy transition so much they want to do it again? And numpy will be much harder to replace than numeric -- numeric wasn't the most-imported package in the pythonverse ;-). And my biggest worry is that if anyone even tries to convince everyone to make this kind of transition, then if they're successful at all then they'll create a substantial period where the ecosystem is a big incompatible mess (and they might still eventually fail, providing no long-term benefit to make up for the immediate costs). This scenario is a nightmare for end-users all around. By comparison, if we improve numpy incrementally, then we can in most cases preserve compatibility totally, and in the rare cases where it's necessary to break something we can do it mindfully, minimally, and with a managed transition. (Downstream packages are already used to handling a few limited API changes at a time, it's not that hard to support both APIs during the transition period, etc., so this way we bring the ecosystem with us.) Every incremental improvement to numpy immediately benefits its immense user base, and gets feedback and testing from that immense user base. And if we incrementally improve interoperability between numpy and other libraries like dynd, then instead of creating fragmentation, it will let downstream packages use both in a complementary way, switching back and forth depending on which provides more utility on a case-by-case basis. If this means that numpy eventually withers away because users vote with their feet, then great, that'd be compelling evidence that whatever they were migrating to really is better, which I trust a lot more than any guesses we make on a mailing list. The gradual approach does require that we be grown-ups and hold our noses while refactoring out legacy spaghetti and writing unaesthetic compatibility hacks. But if you compare this to the alternative... the benefits of incrementalism are, IMO, overwhelming. The only exception is when two specific criteria are met: (1) there are changes that are absolutely necessary for the ecosystem's long term health (e.g., py3's unicode-for-mere-mortals and true division), AND (2) it's absolutely impossible to make these changes incrementally (unicode and true division first entered Python in 2000 and 2001, respectively, and immense effort went into finding the smoothest transition, so it's pretty clear that as painful as py3 has been, there isn't really anything better.). What features could meet these two criteria in numpy's case? If I were the numpy ecosystem and you tried to convince me to suffer through a big-bang transition for the sake of PyObject_VAR_HEAD then I think I'd be kinda unconvinced. And it only took me a few minutes to rattle off a whole list of incremental changes that haven't even been tried yet. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
