Speed in many such loops depends a lot on how the loops are ordered so that
cache and registers can be re-used. I have no idea what will make your
windowing functions fast, but I can say some things about what makes matrix
math fast.
The key with matrix multiplication is that there are n^3/2 operations to do
on n^2 elements. The minimum number of memory operations is n^2 which
sounds good because modern CPU's can often do hundreds of operations per
main memory access. This also means that if we code a naive
implementation, we will generally be memory bound because that will
increase the number of memory operations to k n^3.
To avoid that, the loops involved can be restructured so that larger and
larger blocks of data are used. At the lowest levels, small blocks of 2 x
4 values or so are used to code the multiplication since all of these
values can be kept in registers. At one step up, the computation is
structured to only operate on elements that fit in the fastest level of
cache which is typically 10's of kB in size.
Your loop looks like this:
for (start = 0 ... end-n) {
initialize()
for (offset = 0 ... n-1) {
aggregate(start + offset)
}
finalize()
}
This arrangement is pretty cache friendly if n is small enough, but it
seems that it could be even more friendly if you kept all of the
aggregators at the read and handed each sample to all of the aggregators
before moving to the next position.
On Thu, Jun 11, 2015 at 3:55 PM, Abdel Hakim Deneche <[email protected]>
wrote:
> Hi all,
>
> The purpose of this email is to describe how window functions are computed
> and to try to come up with "better" ways to do it.
>
> DRILL-3200 <https://issues.apache.org/jira/browse/DRILL-3200> added
> support
> for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also made
> some significant improvements to the way Drill computes window functions.
>
> The general idea was to update the code to only support the default frame
> which makes it run faster and use less memory.
>
> WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
> the data to be sorted on the partition and order by columns and only
> computes one frame at a time. With the default frame we only need to
> aggregate every row only once.
> Memory consumption depend on the data, but in general each record batch is
> kept in memory until we are ready to process all it's rows (which is
> possible when we find the last peer row of the batch's last row). Drill's
> external sort can spill to disk if data is too big, and we only need to
> keep at most one partition's worth of data in memory for the window
> functions to be computed (when over clause doesn't contain an order by)
>
> Each time a batch is ready to be processed we do the following:
>
> 1- we start with it's first row (current row)
> 2- we compute the length of the current row's frame (in this case we find
> the number of peer rows for the current row),
> 3- we aggregate (this includes computing the window function values) all
> rows of the current frame
> 4- we write the aggregated value in each row of the current frame.
> 5- We then move to the 1st non peer row which becomes the current row
> 6- if we didn't reach the end of the current batch go back to 2
>
> With all this in mind, how can we improve the performance of window
> functions ?
>
> Thanks!
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
> <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>