[ 
https://issues.apache.org/jira/browse/CLIMATE-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736280#comment-13736280
 ] 

Alex Goodman commented on CLIMATE-248:
--------------------------------------

Some bad news. I actually went and tested a few of these changes, in order that 
I mentioned them:

The first change made almost no difference in the the running time of the 
tests, maybe about 1 second or so. The second change saved about 5 seconds but 
unfortunately doesn't work correctly. I misunderstood how the 
create_mask_function works, ie the shape doesn't match up. That's why 
pre-allocating the array was necessary.

(Keep in mind test_dataset_processor.py takes about 30 sec to run on my laptop 
without any changes)

So now we can say that the worst case scenario I mentioned is most likely the 
source of the bottleneck.
                
> PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time
> --------------------------------------------------------------------------
>
>                 Key: CLIMATE-248
>                 URL: https://issues.apache.org/jira/browse/CLIMATE-248
>             Project: Apache Open Climate Workbench
>          Issue Type: Improvement
>          Components: regridding
>    Affects Versions: 0.1-incubating, 0.2-incubating
>         Environment: *nix
>            Reporter: Cameron Goodale
>            Assignee: Cameron Goodale
>              Labels: performance
>             Fix For: 0.3-incubating
>
>
> When I was testing the dataset_processor module I noticed that most tests 
> would complete in less than 1 second.  Then I came across the 
> "test_daily_to_monthly_rebin" test and it would take over 2 minutes to 
> complete.
> The test initially used a 1x1 degree grid covering the globe and daily time 
> step for 2 years (730 days).
> I ran some initial checks and the lag appears to be down in the code where 
> the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.
> {code}
>                 mask = np.zeros_like(data)
>                 mask[timeunits!=myunit,:,:] = 1.0
>                 # Calculate missing data mask within each time unit...
>                 datamask_at_this_timeunit = np.zeros_like(data)
>                 datamask_at_this_timeunit[:]= 
> process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
>                 # Store results for masking later
>                 datamask_store.append(datamask_at_this_timeunit[0])
>                 # Calculate means for each pixel in this time unit, ignoring 
> missing data (using masked array).
>                 datam = 
> ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
>                 meanstore[i,:,:] = ma.average(datam,axis=0)
> {code}
> That block is suspect since the rest of the code is doing simple string 
> parsing and appending to lists.  I don't have the time to do a deep dive into 
> this now, and it technically isn't broken, but just really slow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to