[jira] [Comment Edited] (CLIMATE-248) PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time

Alex Goodman (JIRA) Thu, 08 Aug 2013 09:35:24 -0700

    [ 
https://issues.apache.org/jira/browse/CLIMATE-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733654#comment-13733654
 ]


Alex Goodman edited comment on CLIMATE-248 at 8/8/13 4:32 PM:
--------------------------------------------------------------

Cam,

Here are a few quick improvements I can think of:

First, there are many times where the code calls:
np.unique(timeunits)

You should only make this call once and store it into a variable, then when it 
is needed again you just use that variable.

Also, remove this line:
datamask_at_this_timeunit = np.zeros_like(data)

And change the following line to:
datamask_at_this_timeunit = 
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

A key thing you should know about numpy: Constructing new arrays is very 
expensive, so if it is not necessary then we should not do it. Since 
process.create_mask_using_threshold() builds a new array anyway, there is not 
need to allocate a new array with the same shape in advance. Likewise with 
np.unique(), a new array is allocated each time it gets called, so doing it 
only once should also save you some time.

I don't expect these changes to give you much improvement in performance, maybe 
about 30% at the most. I suspect that the bottleneck is a result of the 
rebinning being done in a loop rather than a single vectorized numpy array 
operation. While it is obvious that python loops are inherently slow, there are 
other consequences at play too. One is that the output array has to be 
pre-allocated (line 316):

meanstore = np.zeros([len(np.unique(timeunits)),data.shape[1],data.shape[2]])

Which I have already explained why that will slow things down. Additionally, 
there is this line too inside the loop:
datamask_at_this_timeunit = 
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

Specifically the part where the boolean masked array is being created: 
data[timeunits==myunit,:,:]. This is an example of what numpy refers to as 
fancy indexing, the distinction being that such a slice of an array is a copy, 
while regular slice indexing (eg data[3:4]) only shows a view of the original 
array. 

Unfortunately, fixing this is not trivial due to the inherent nature of daily 
data, specifically each month of the year does not have the same number of 
total days. As a result it is not possible to simply reshape the original data 
and then use one vectorized numpy operation like I did with monthly data. The 
best way to get around this in numpy that I could think of would be to use a 
masked array with shape (number_of_months, 31, ...), then have the months with 
less than 31 days have masked values in the excess days. You can then get the 
monthly means really easily with something like new_array.mean(axis=1). This is 
far from an ideal solution though because 1), inserting the masked values into 
the original array in the right spots can be tricky and 2), this cannot be done 
on the original array in place so allocating another new array is necessary. 

The absolute fastest way to do this then is to make your own python extension 
using C, C++, or Fortran where loops are much quicker, and then call a wrapper 
to the compiled code in python. As we all know though this is opening another 
large can of worms and will probably be a lot of work in of itself. I'd say too 
much for just this particular issue, but in the long term it might be necessary 
if we want to support daily data.

Alex
                
      was (Author: agoodman):
    Cam,

Here are a quick improvments I can think of:

First, there are many times where the code calls:
np.unique(timeunits)

You should only make this call once and store it into a variable, then when it 
is needed again you just use that variable.

Also, remove this line:
datamask_at_this_timeunit = np.zeros_like(data)

And change the following line to:
datamask_at_this_timeunit = 
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

A key thing you should know about numpy: Constructing new arrays is very 
expensive, so if it is not necessary then we should not do it. Since 
process.create_mask_using_threshold() builds a new array anyway, there is not 
need to allocate a new array with the same shape in advance. Likewise with 
np.unique(), a new array is allocated each time it gets called, so doing it 
only once should also save you some time.

I don't expect these changes to give you much improvement in performance, maybe 
about 30% at the most. I suspect that the bottleneck is a result of the 
rebinning being done in a loop rather than a single vectorized numpy array 
operation. While it is obvious that python loops are inherently slow, there are 
other consequences at play too. One is that the output array has to be 
pre-allocated (line 316):

meanstore = np.zeros([len(np.unique(timeunits)),data.shape[1],data.shape[2]])

Which I have already explained why that will slow things down. Additionally, 
there is this line too inside the loop:
datamask_at_this_timeunit = 
process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

Specifically the part where the boolean masked array is being created: 
data[timeunits==myunit,:,:]. This is an example of what numpy refers to as 
fancy indexing, the distinction being that such a slice of an array is a copy, 
while regular slice indexing (eg data[3:4]) only shows a view of the original 
array. 

Unfortunately, fixing this is not trivial due to the inherent nature of daily 
data, specifically each month of the year does not have the same number of 
total days. As a result it is not possible to simply reshape the original data 
and then use one vectorized numpy operation like I did with monthly data. The 
best way to get around this in numpy that I could think of would be to use a 
masked array with shape (number_of_months, 31, ...), then have the months with 
less than 31 days have masked values in the excess days. You can then get the 
monthly means really easily with something like new_array.mean(axis=1). This is 
far from an ideal solution though because 1), inserting the masked values into 
the original array in the right spots can be tricky and 2), this cannot be done 
on the original array in place so allocating another new array is necessary. 

The absolute fastest way to do this then is to make your own python extension 
using C, C++, or Fortran where loops are much quicker, and then call a wrapper 
to the compiled code in python. As we all know though this is opening another 
large can of worms and will probably be a lot of work in of itself. I'd say too 
much for just this particular issue, but in the long term it might be necessary 
if we want to support daily data.

Alex
                  
> PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time
> --------------------------------------------------------------------------
>
>                 Key: CLIMATE-248
>                 URL: https://issues.apache.org/jira/browse/CLIMATE-248
>             Project: Apache Open Climate Workbench
>          Issue Type: Improvement
>          Components: regridding
>    Affects Versions: 0.1-incubating, 0.2-incubating
>         Environment: *nix
>            Reporter: Cameron Goodale
>            Assignee: Cameron Goodale
>              Labels: performance
>             Fix For: 0.3-incubating
>
>
> When I was testing the dataset_processor module I noticed that most tests 
> would complete in less than 1 second.  Then I came across the 
> "test_daily_to_monthly_rebin" test and it would take over 2 minutes to 
> complete.
> The test initially used a 1x1 degree grid covering the globe and daily time 
> step for 2 years (730 days).
> I ran some initial checks and the lag appears to be down in the code where 
> the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.
> {code}
>                 mask = np.zeros_like(data)
>                 mask[timeunits!=myunit,:,:] = 1.0
>                 # Calculate missing data mask within each time unit...
>                 datamask_at_this_timeunit = np.zeros_like(data)
>                 datamask_at_this_timeunit[:]= 
> process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
>                 # Store results for masking later
>                 datamask_store.append(datamask_at_this_timeunit[0])
>                 # Calculate means for each pixel in this time unit, ignoring 
> missing data (using masked array).
>                 datam = 
> ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
>                 meanstore[i,:,:] = ma.average(datam,axis=0)
> {code}
> That block is suspect since the rest of the code is doing simple string 
> parsing and appending to lists.  I don't have the time to do a deep dive into 
> this now, and it technically isn't broken, but just really slow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CLIMATE-248) PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time

Reply via email to