[ 
https://issues.apache.org/jira/browse/CALCITE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846227#comment-16846227
 ] 

Julian Hyde commented on CALCITE-2979:
--------------------------------------

By 'assigning keys to batches' I meant deciding when you have enough keys in a 
batch that it is worthwhile making a round-trip.

[~zabetak] described a strategy using BLK_SIZE. Let's BLK_SIZE is 1000. Then 
the first batch will consist of key 0 through key 999. The second batch will 
consist of key 1000 through key 1999.

But it's not the only possible strategy for assigning keys to batches. Let's 
suppose that the keys are date values, and you are probing into a table that is 
partitioned by month. You could construct batches that contain all keys that 
fall within a particular month, and therefore you would make at most one 
round-trip to each partition.

Other strategies are possible, if you generalize away from using fixed size 
batches with BLK_SIZE entires.

I will note that what you are doing is a form of work scheduling. (Fetching a 
row that corresponds to a particular key is an item of work, and of course 
grouping work items into batches is a well-known way to increase efficiency by 
trading off latency.) There are many strategies for work scheduling, such as 
[round-robin|https://en.wikipedia.org/wiki/Round-robin_scheduling]. Some of 
those strategies could be considered here.

Schedulers often distribute work via queues, and I have a feeling that queues 
could be useful in this algorithm, too.

> Add a block-based nested loop join algorithm
> --------------------------------------------
>
>                 Key: CALCITE-2979
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2979
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Khawla Mouhoubi
>            Priority: Major
>              Labels: performance
>
> Currently, Calcite provides a tuple-based nested loop join algorithm 
> implemented through EnumerableCorrelate and EnumerableDefaults.correlateJoin. 
> This means that for each tuple of the outer relation we probe (set variables) 
> in the inner relation.
> The goal of this issue is to add new algorithm (or extend the correlateJoin 
> method) which first gathers blocks (batches) of tuples from the outer 
> relation and then probes the inner relation once per block.
> There are cases (eg., indexes) where the inner relation can be accessed by 
> more than one value which can greatly improve the performance in particular 
> when the outer relation is big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to