[ 
https://issues.apache.org/jira/browse/CLIMATE-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743443#comment-13743443
 ] 

Alex Goodman commented on CLIMATE-248:
--------------------------------------

Hi Cam,

Yes indeed. I have already tried to come up with an alternative approach to 
averaging a masked array in python but I am afraid that I can't think of any 
faster method without resorting to a C or Fortran extension, which is probably 
overkill for this issue alone as I have already said.

However, I have discovered a way to fix another source for the bottleneck, 
which is the use of fancy indexing in this line:

{code}
380    datamask_at_this_timeunit[:]= 
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75) 
{code}

My solution relies on standard slice indexing since this has much less overhead 
than fancy boolean indexing. The trick here is to realize that the first day of 
a new month is always 1 in a python datetime object sense. Then we find the 
indices in the array where this occurs and use those to split the array into 
chunks corresponding to each month of data. Then taking the average is simply a 
matter of averaging each individual chunk.

I attached the code I used to test this as test_monthly_rebin.py. During my own 
testing, the average speed up after using this method was about 3x. It could be 
even faster (5x or 6x) if the arrays are not masked. Try incorporating 
something like this back into the original function and see if that helps.
                
> PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time
> --------------------------------------------------------------------------
>
>                 Key: CLIMATE-248
>                 URL: https://issues.apache.org/jira/browse/CLIMATE-248
>             Project: Apache Open Climate Workbench
>          Issue Type: Improvement
>          Components: regridding
>    Affects Versions: 0.1-incubating, 0.2-incubating
>         Environment: *nix
>            Reporter: Cameron Goodale
>            Assignee: Cameron Goodale
>              Labels: performance
>             Fix For: 0.3-incubating
>
>         Attachments: inital_profile.txt, test_monthly_rebin.py, test.py
>
>
> When I was testing the dataset_processor module I noticed that most tests 
> would complete in less than 1 second.  Then I came across the 
> "test_daily_to_monthly_rebin" test and it would take over 2 minutes to 
> complete.
> The test initially used a 1x1 degree grid covering the globe and daily time 
> step for 2 years (730 days).
> I ran some initial checks and the lag appears to be down in the code where 
> the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.
> {code}
>                 mask = np.zeros_like(data)
>                 mask[timeunits!=myunit,:,:] = 1.0
>                 # Calculate missing data mask within each time unit...
>                 datamask_at_this_timeunit = np.zeros_like(data)
>                 datamask_at_this_timeunit[:]= 
> process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
>                 # Store results for masking later
>                 datamask_store.append(datamask_at_this_timeunit[0])
>                 # Calculate means for each pixel in this time unit, ignoring 
> missing data (using masked array).
>                 datam = 
> ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
>                 meanstore[i,:,:] = ma.average(datam,axis=0)
> {code}
> That block is suspect since the rest of the code is doing simple string 
> parsing and appending to lists.  I don't have the time to do a deep dive into 
> this now, and it technically isn't broken, but just really slow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to