[
https://issues.apache.org/jira/browse/CLIMATE-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733654#comment-13733654
]
Alex Goodman edited comment on CLIMATE-248 at 8/8/13 4:33 PM:
--------------------------------------------------------------
Cam,
Here are a few quick improvements I can think of:
First, there are many times where the code calls:
np.unique(timeunits)
You should only make this call once and store it into a variable, then when it
is needed again you just use that variable.
Also, remove this line:
datamask_at_this_timeunit = np.zeros_like(data)
And change the following line to:
datamask_at_this_timeunit =
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
A key thing you should know about numpy: Constructing new arrays is very
expensive, so if it is not necessary then we should not do it. Since
process.create_mask_using_threshold() builds a new array anyway, there is no
need to allocate a new array with the same shape in advance. Likewise with
np.unique(), a new array is allocated each time it gets called, so doing it
only once should also save you some time.
I don't expect these changes to give you much improvement in performance, maybe
about 30% at the most. I suspect that the bottleneck is a result of the
rebinning being done in a loop rather than a single vectorized numpy array
operation. While it is obvious that python loops are inherently slow, there are
other consequences at play too. One is that the output array has to be
pre-allocated (line 316):
meanstore = np.zeros([len(np.unique(timeunits)),data.shape[1],data.shape[2]])
Which I have already explained why that will slow things down. Additionally,
there is this line too inside the loop:
datamask_at_this_timeunit =
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
Specifically the part where the boolean masked array is being created:
data[timeunits==myunit,:,:]. This is an example of what numpy refers to as
fancy indexing, the distinction being that such a slice of an array is a copy,
while regular slice indexing (eg data[3:4]) only shows a view of the original
array.
Unfortunately, fixing this is not trivial due to the inherent nature of daily
data, specifically each month of the year does not have the same number of
total days. As a result it is not possible to simply reshape the original data
and then use one vectorized numpy operation like I did with monthly data. The
best way to get around this in numpy that I could think of would be to use a
masked array with shape (number_of_months, 31, ...), then have the months with
less than 31 days have masked values in the excess days. You can then get the
monthly means really easily with something like new_array.mean(axis=1). This is
far from an ideal solution though because 1), inserting the masked values into
the original array in the right spots can be tricky and 2), this cannot be done
on the original array in place so allocating another new array is necessary.
The absolute fastest way to do this then is to make your own python extension
using C, C++, or Fortran where loops are much quicker, and then call a wrapper
to the compiled code in python. As we all know though this is opening another
large can of worms and will probably be a lot of work in of itself. I'd say too
much for just this particular issue, but in the long term it might be necessary
if we want to support daily data.
Alex
was (Author: agoodman):
Cam,
Here are a few quick improvements I can think of:
First, there are many times where the code calls:
np.unique(timeunits)
You should only make this call once and store it into a variable, then when it
is needed again you just use that variable.
Also, remove this line:
datamask_at_this_timeunit = np.zeros_like(data)
And change the following line to:
datamask_at_this_timeunit =
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
A key thing you should know about numpy: Constructing new arrays is very
expensive, so if it is not necessary then we should not do it. Since
process.create_mask_using_threshold() builds a new array anyway, there is not
need to allocate a new array with the same shape in advance. Likewise with
np.unique(), a new array is allocated each time it gets called, so doing it
only once should also save you some time.
I don't expect these changes to give you much improvement in performance, maybe
about 30% at the most. I suspect that the bottleneck is a result of the
rebinning being done in a loop rather than a single vectorized numpy array
operation. While it is obvious that python loops are inherently slow, there are
other consequences at play too. One is that the output array has to be
pre-allocated (line 316):
meanstore = np.zeros([len(np.unique(timeunits)),data.shape[1],data.shape[2]])
Which I have already explained why that will slow things down. Additionally,
there is this line too inside the loop:
datamask_at_this_timeunit =
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
Specifically the part where the boolean masked array is being created:
data[timeunits==myunit,:,:]. This is an example of what numpy refers to as
fancy indexing, the distinction being that such a slice of an array is a copy,
while regular slice indexing (eg data[3:4]) only shows a view of the original
array.
Unfortunately, fixing this is not trivial due to the inherent nature of daily
data, specifically each month of the year does not have the same number of
total days. As a result it is not possible to simply reshape the original data
and then use one vectorized numpy operation like I did with monthly data. The
best way to get around this in numpy that I could think of would be to use a
masked array with shape (number_of_months, 31, ...), then have the months with
less than 31 days have masked values in the excess days. You can then get the
monthly means really easily with something like new_array.mean(axis=1). This is
far from an ideal solution though because 1), inserting the masked values into
the original array in the right spots can be tricky and 2), this cannot be done
on the original array in place so allocating another new array is necessary.
The absolute fastest way to do this then is to make your own python extension
using C, C++, or Fortran where loops are much quicker, and then call a wrapper
to the compiled code in python. As we all know though this is opening another
large can of worms and will probably be a lot of work in of itself. I'd say too
much for just this particular issue, but in the long term it might be necessary
if we want to support daily data.
Alex
> PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time
> --------------------------------------------------------------------------
>
> Key: CLIMATE-248
> URL: https://issues.apache.org/jira/browse/CLIMATE-248
> Project: Apache Open Climate Workbench
> Issue Type: Improvement
> Components: regridding
> Affects Versions: 0.1-incubating, 0.2-incubating
> Environment: *nix
> Reporter: Cameron Goodale
> Assignee: Cameron Goodale
> Labels: performance
> Fix For: 0.3-incubating
>
>
> When I was testing the dataset_processor module I noticed that most tests
> would complete in less than 1 second. Then I came across the
> "test_daily_to_monthly_rebin" test and it would take over 2 minutes to
> complete.
> The test initially used a 1x1 degree grid covering the globe and daily time
> step for 2 years (730 days).
> I ran some initial checks and the lag appears to be down in the code where
> the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.
> {code}
> mask = np.zeros_like(data)
> mask[timeunits!=myunit,:,:] = 1.0
> # Calculate missing data mask within each time unit...
> datamask_at_this_timeunit = np.zeros_like(data)
> datamask_at_this_timeunit[:]=
> process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
> # Store results for masking later
> datamask_store.append(datamask_at_this_timeunit[0])
> # Calculate means for each pixel in this time unit, ignoring
> missing data (using masked array).
> datam =
> ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
> meanstore[i,:,:] = ma.average(datam,axis=0)
> {code}
> That block is suspect since the rest of the code is doing simple string
> parsing and appending to lists. I don't have the time to do a deep dive into
> this now, and it technically isn't broken, but just really slow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira