[Numpy-discussion] Missing data again

Travis Oliphant Sat, 03 Mar 2012 12:30:55 -0800

Hi all, 

I've been thinking a lot about the masked array implementation lately.     I 
finally had the time to look hard at what has been done and now am of the 
opinion that I do not think that 1.7 can be released with the current state of 
the masked array implementation *unless* it is clearly marked as experimental 
and may be changed in 1.8


I wish I had been able to be a bigger part of this conversation last year.   
But, that is why I took the steps I took to try and figure out another way to 
feed my family *and* stay involved in the NumPy community.   I would love to 
stay involved in what is happening in the SciPy community, but I am more 
satisfied with what Ralf, Warren, Robert, Pauli, Josef, Charles, Stefan, and 
others are doing there right now, and don't have time to keep up with 
everything.    Even though SciPy was the heart and soul of why I even got 
involved with Python for open source in the first place and took many years of 
my volunteer labor, I won't be able to spend significant time on SciPy code 
over the coming months.   At some point, I really hope to be able to make 
contributions again to that code-base.   Time will tell whether or not my 
aspirations will be realized.  It depends quite a bit on whether or not my kids 
have what they need from me (which right now is money and time). 
 
NumPy, on the other hand, is not in a position where I can feel comfortable 
leaving my "baby" to others.  I recognize and value the contributions from many 
people to make NumPy what it is today (e.g. code contributions, code 
rearrangement and standardization, build and install improvement, and most 
recently some architectural changes).    But, I feel a personal responsibility 
for the code base as I spent a great many months writing NumPy in the first 
place, and I've spent a great deal of time interacting with NumPy users and 
feel like I have at least some sense of their stories.    Of course, I built on 
the shoulders of giants, and much of what is there is *because of* where the 
code was adapted from (it was not created de-novo).   Currently,  there remains 
much that needs to be communicated, improved, and worked on, and I have 
specific opinions about what some changes and improvements should be, how they 
should be written, and how the resulting users need to be benefited.   
 It will take time to discuss all of this, and that's where I will spend my 
open-source time in the coming months. 

In that vein: 

Because it is slated to go into release 1.7, we need to re-visit the masked 
array discussion again.    The NEP process is the appropriate one and I'm glad 
we are taking that route for these discussions.   My goal is to get consensus 
in order for code to get into NumPy (regardless of who writes the code).    It 
may be that we don't come to a consensus (reasonable and intelligent people can 
disagree on things --- look at the coming election...).   We can represent 
different parts of what is fortunately a very large user-base of NumPy users.   
 

First of all, I want to be clear that I think there is much great work that has 
been done in the current missing data code.  There are some nice features in 
the where clause of the ufunc and the machinery for the iterator that allows 
re-using ufunc loops that are not re-written to check for missing data.   I'm 
sure there are other things as well that I'm not quite aware of yet.    
However, I don't think the API presented to the numpy user presently is the 
correct one for NumPy 1.X.   

A few particulars: 

        * the reduction operations need to default to "skipna" --- this is the 
most common use case which has been re-inforced again to me today by a new user 
to Python who is using masked arrays presently 
        
        * the mask needs to be visible to the user if they use that approach to 
missing data (people should be able to get a hold of the mask and work with it 
in Python)

        * bit-pattern approaches to missing data (at least for float64 and 
int32) need to be implemented. 

        * there should be some way when using "masks" (even if it's hidden from 
most users) for missing data to separate the low-level ufunc operation from the 
operation
           on the masks...

I have heard from several users that they will *not use the missing data* in 
NumPy as currently implemented, and I can now see why.    For better or for 
worse, my approach to software is generally very user-driven and very 
pragmatic.  On the other hand, I'm also a mathematician and appreciate the 
cognitive compression that can come out of well-formed structure.    
None-the-less, I'm an *applied* mathematician and am ultimately motivated by 
applications.

I will get a hold of the NEP and spend some time with it to discuss some of 
this in that document.   This will take several weeks (as PyCon is next week 
and I have a tutorial I'm giving there).    For now, I do not think 1.7 can be 
released unless the masked array is labeled *experimental*.  

Thanks,

-Travis












_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Missing data again

Reply via email to