I think that with your comments in mind, it may just be best to embrace duck typing, like Matthew suggested. I propose the following workflow:
- __array_concatenate__ and similar "protocol" functions return NotImplemented if they won't work. - "Base functions" that can be called directly like __getitem__ raise NotImplementedError if they won't work. - __arrayish__ = True Then, something like np.concatenate would do the following: - Call __array_concatenate__ following the same order as ufunc arguments. - If everything fails, raise NotImplementedError (or convert everything to ndarray). Overloaded functions would do something like this (perhaps a simple decorator will do for the repetitive work?): - Try with np.arrayish - Catch NotImplementedError - Try with np.array Then, we use abstract classes just to overload functionality or implement things in terms of others. If something fails, we have a decent fallback. We don't need to do anything special in order to "check" functionality. Feel free to propose changes, but this is the best I could come up with that would require the smallest incremental changes to Numpy while also supporting everything right from the start. On Thu, Mar 22, 2018 at 9:14 AM, Nathaniel Smith <n...@pobox.com> wrote: > On Sat, Mar 10, 2018 at 4:27 AM, Matthew Rocklin <mrock...@gmail.com> > wrote: > > I'm very glad to see this discussion. > > > > I think that coming up with a single definition of array-like may be > > difficult, and that we might end up wanting to embrace duck typing > instead. > > > > It seems to me that different array-like classes will implement different > > mixtures of features. It may be difficult to pin down a single > definition > > that includes anything except for the most basic attributes (shape and > > dtype?). Consider two extreme cases of restrictive functionality: > > > > LinearOperators (support dot in a numpy-like way) > > Storage objects like h5py (support getitem in a numpy-like way) > > > > I can imagine authors of both groups saying that they should qualify as > > array-like because downstream projects that consume them should not > convert > > them to numpy arrays in important contexts. > > I think this is an important point -- there are a lot of subtleties in > the interfaces that different objects might want to provide. Some > interesting ones that haven't been mentioned: > > - a "duck array" that has everything except fancy indexing > - xarray's arrays are just like numpy arrays in most ways, but they > have incompatible broadcasting semantics > - immutable vs. mutable arrays > > When faced with this kind of situation, always it's tempting to try to > write down some classification system to capture every possible > configuration of interesting behavior. In fact, this is one of the > most classic nerd snipes; it's been catching people for literally > thousands of years [1]. Most of these attempts fail though :-). > > So let's back up -- I probably erred in not making this more clear in > the NEP, but I actually have a fairly concrete use case in mind here. > What happened is, I started working on a NEP for > __array_concatenate__, and my thought pattern went as follows: > > 1) Cool, this should work for np.concatenate. > 2) But what about all the other variants, like np.row_stack. We don't > want __array_row_stack__; we want to express row_stack in terms of > concatenate. > 3) Ok, what's row_stack? It's: > np.concatenate([np.atleast_2d(arr) for arr in arrs], axis=0) > 4) So I need to make atleast_2d work on duck arrays. What's > atleast_2d? It's: asarray + some shape checks and indexing with > newaxis > 5) Okay, so I need something atleast_2d can call instead of asarray [2]. > > And this kind of pattern shows up everywhere inside numpy, e.g. it's > the first thing inside lots of functions in np.linalg b/c they do some > futzing with dtypes and shape before delegating to ufuncs, it's the > first thing the mean() function does b/c it needs to check arr.dtype > before proceeding, etc. etc. > > So, we need something we can use in these functions as a first step > towards unlocking the use of duck arrays in general. But we can't > realistically go through each of these functions, make an exact list > of all the operations/attributes it cares about, and then come up with > exactly the right type constraint for it to impose at the top. And > these functions aren't generally going to work on LinearOperators or > h5py datasets anyway. > > We also don't want to go through every function in numpy and add new > arguments to control this coercion behavior. > > What we can do, at least to start, is to have a mechanism that passes > through objects that aspire to be "complete" duck arrays, like dask > arrays or sparse arrays or astropy's unit arrays, and then if it turns > out that in practice people find uses for finer-grained distinctions, > we can iteratively add those as a second pass. Notice that if a > function starts out requiring a "complete" duck array, and then later > relaxes that to accept "partial" duck arrays, that's actually > increasing the domain of objects that it can act on, so it's a > backwards-compatible change that we can do later. > > So I think we should start out with a concept of "duck array" that's > fairly strong but a bit vague on the exact details (e.g., > dask.array.Array is currently missing some weird things like arr.ptp() > and arr.tolist(), I guess because no-one has ever noticed or cared?). > > ------------ > > Thinking things through like this, I also realized that this proposal > jumps through hoops to avoid changing np.asarray itself, because I was > nervous about changing the rule that its output is always an > ndarray... but actually, this is currently the rule for most functions > in numpy, and the whole point of this proposal is to relax that rule > for most functions, in cases where the user is explicitly passing in a > duck-array object. So maybe I'm being overparanoid? I'm genuinely > unsure here. > > Instead of messing about with ABCs, an alternative mechanism would be > to add a new method __arrayish__ (hat tip to Tom Caswell for the name > :-)), that essentially acts as an override for Python-level calls to > np.array / np.asarray, in much the same way that __array_ufunc__ > overrides ufuncs, etc. (C level calls to PyArray_FromAny and similar > would of course continue to return ndarray objects, and I assume we'd > add some argument like require_ndarray= that you could pass to > explicitly indicate whether you needed C-level compatibility.) > > This would also allow objects like h5py datasets to *produce* an > arrayish object on demand, even if they aren't one themselves. (E.g., > imagine some hdf5-like storage that holds sparse arrays instead of > regular arrays.) > > I'm thinking I may write this option up as a second NEP, to compete > with my first one. > > -n > > [1] See: https://www.wiley.com/en-us/The+Search+for+the+Perfect+ > Language-p-9780631205104 > [2] Actually atleast_2d calls asanyarray, not asarray, but that's just > a detail; the way to solve this problem for asanyarray is to first > solve it for asarray. > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion