Short answer to the subject: Oh yes. Basically, MaskedArrays in its current implementation is more of a convenience class than anything. Most of the functions manipulating masked arrays create a lot of temporaries. When performance is needed, I must advise you to work directly on the data and the mask.
For example, let's examine the division of 2 MaskedArrays a & b. * We take the 2 ndarrays of data (da and db) and the 2 ndarrays of mask (ma and mb) * we create a new array for db using np.where, putting 1 where db==0 and keeping db otherwise (if we were not doing that, we would get some NaNs down the road) * we create a new mask by combining ma and mb * we create the result array using np.where, using da where m is True, da/db otherwise (if we were not doing that, we would be processing the masked data and we may not want that) * Then, we add the mask to the result array. I suspect that the np.where functions are sub-optimal, and there might be a smarter way to achieve the same result while keeping all the functionalities (no NaNs (even masked) in the result, data kept when it should). I agree that these functionalities might be a bit overkill in simpler cases, such as yours. You may then want to use something like >>> ma.masked_array(a.data/b.data, mask=(a.mask | b.mask | (b.data==0)) Using Eric's example, I have 229ms/loop when dividing 2 ndarrays, 2.83s/loop when dividing 2 masked arrays, and down to 493ms/loop when using the quick-and-dirty function above). So anyway, you'll still be slower using MA than ndarrays, but not as slow... On May 9, 2009, at 5:22 PM, Eli Bressert wrote: > Hi, > > I'm using masked arrays to compute large-scale standard deviation, > multiplication, gaussian, and weighted averages. At first I thought > using the masked arrays would be a great way to sidestep looping > (which it is), but it's still slower than expected. Here's a snippet > of the code that I'm using it for. > > # Computing nearest neighbor distances. > # Output will be about 270,000 rows long for the index > # and 270,000x50 for the dist array. > tree = ann.kd_tree(np.column_stack([l,b])) > index, dist = tree.search(np.column_stack([l,b]),k=nth) > > # Clipping bad values by replacing them acceptable values > av[np.where(av<=-10)] = -10 > av[np.where(av>=50)] = 50 > > # Distance clipping and creating mask > dist_arcsec = np.sqrt(dist)*3600 > mask = dist_arcsec <= d_thresh > > # Creating masked array > av_good = ma.array(av[index],mask=mask) > dist_good = ma.array(dist_arcsec,mask=mask) > > # Reason why I'm using masked arrays. If these were > # ndarrays with nan's, then the output would be nan. > Std = np.array(np.std(av_good,axis=1)) > Var = Std*Std > > Rho = np.zeros( (len(av), nth) ) > Rho2 = np.zeros( (len(av), nth) ) > > dist_std = np.std(dist_good,axis=1) > > for j in range(nth): > Rho[:,j] = dist_std > Rho2[:,j] = Var > > # This part takes about 20 seconds to compute for a 270,000x50 > masked array. > # Using ndarrays of the same size takes about 2 second > spatial_weight = 1.0 / (Rho*np.sqrt(2*np.pi)) * np.exp( - dist_good / > (2*Rho**2)) > > # Like the spatial_weight section, this takes about 20 seconds > W = spatial_weight / Rho2 > > # Takes less than one second. > Ave = np.average(av_good,axis=1,weights=W) > > Any ideas on why it would take such a long time for processing? > Especially the spatial_weight and W variables? Would there be a faster > way to do this? Or is there a way that numpy.std can process ignore > nan's when processing? > > Thanks, > > Eli Bressert > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion