[
https://issues.apache.org/jira/browse/CLIMATE-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736280#comment-13736280
]
Alex Goodman commented on CLIMATE-248:
--------------------------------------
Some bad news. I actually went and tested a few of these changes, in order that
I mentioned them:
The first change made almost no difference in the the running time of the
tests, maybe about 1 second or so. The second change saved about 5 seconds but
unfortunately doesn't work correctly. I misunderstood how the
create_mask_function works, ie the shape doesn't match up. That's why
pre-allocating the array was necessary.
(Keep in mind test_dataset_processor.py takes about 30 sec to run on my laptop
without any changes)
So now we can say that the worst case scenario I mentioned is most likely the
source of the bottleneck.
> PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time
> --------------------------------------------------------------------------
>
> Key: CLIMATE-248
> URL: https://issues.apache.org/jira/browse/CLIMATE-248
> Project: Apache Open Climate Workbench
> Issue Type: Improvement
> Components: regridding
> Affects Versions: 0.1-incubating, 0.2-incubating
> Environment: *nix
> Reporter: Cameron Goodale
> Assignee: Cameron Goodale
> Labels: performance
> Fix For: 0.3-incubating
>
>
> When I was testing the dataset_processor module I noticed that most tests
> would complete in less than 1 second. Then I came across the
> "test_daily_to_monthly_rebin" test and it would take over 2 minutes to
> complete.
> The test initially used a 1x1 degree grid covering the globe and daily time
> step for 2 years (730 days).
> I ran some initial checks and the lag appears to be down in the code where
> the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.
> {code}
> mask = np.zeros_like(data)
> mask[timeunits!=myunit,:,:] = 1.0
> # Calculate missing data mask within each time unit...
> datamask_at_this_timeunit = np.zeros_like(data)
> datamask_at_this_timeunit[:]=
> process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
> # Store results for masking later
> datamask_store.append(datamask_at_this_timeunit[0])
> # Calculate means for each pixel in this time unit, ignoring
> missing data (using masked array).
> datam =
> ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
> meanstore[i,:,:] = ma.average(datam,axis=0)
> {code}
> That block is suspect since the rest of the code is doing simple string
> parsing and appending to lists. I don't have the time to do a deep dive into
> this now, and it technically isn't broken, but just really slow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira