[
https://issues.apache.org/jira/browse/CALCITE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839358#comment-16839358
]
Ruben Quesada Lopez commented on CALCITE-2979:
----------------------------------------------
[~zabetak], I have a general question regarding the design of this new
block-based nested loop join algorithm. Let use say that the plan that you just
wrote in your last comment implements an inner join:
{code}
Join(A.id > B.id; type=INNER)
Scan(A)
Scan(B)
==>
NestedLoop(blockSize=3)
Scan(A)
ElasticScan(table=B, query="OR(>(cor0[0],B.id), >(cor0[1],B.id),
>(cor0[2],B.id)")
{code}
The result of the NestedLoop operator must contain a tupleA concatenated with a
tupleB that match. If we have a block of correlated variables, and we translate
the join condition as a filter with an OR conditions, we would get the
appropriate tupleB, but how does the NestedLoop know with which tuple(s) from A
it must be concatenated. We know for sure that the tupleB match with cor0[0] or
cor0[1] or cor0[2], but which one(s) exactly? Maybe I'm missing something, but
this information will not be available from NestedLoop perspective.
The only scenario that I can think of where this strategy could work is a
correlated SEMI (or ANTI) join, because only the left part is returned after
the first match (or no match at all) with the right part, and we do not care
which exact tuple from B matched. But in order to implement such strategy, we
would to inverse the correlated variable logic, and set it from right to left
(instead of left to right):
{code}
-- fetch the depts with at least one emp
Join (dept.id=emp.dptoId, type=SEMI)
Scan (dept)
Scan (emp)
==>
NestedLoop(blockSize=3, $cor0=emp)
Filter( OR( =(cor0[0].dptId, dept.id), =(cor0[1].dptId, dept.id),
=(cor0[2].dptId, dept.id) )
Scan(dept)
Scan(emp)
==>
NestedLoop(blockSize=3, $cor0=emp)
ElasticScan(table=dept, query="OR( =(cor0[0].dptId, dept.id),
=(cor0[1].dptId, dept.id), =(cor0[2].dptId, dept.id)" )
Scan(emp)
{code}
> Add a block-based nested loop join algorithm
> --------------------------------------------
>
> Key: CALCITE-2979
> URL: https://issues.apache.org/jira/browse/CALCITE-2979
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.19.0
> Reporter: Stamatis Zampetakis
> Assignee: Khawla Mouhoubi
> Priority: Major
> Labels: performance
>
> Currently, Calcite provides a tuple-based nested loop join algorithm
> implemented through EnumerableCorrelate and EnumerableDefaults.correlateJoin.
> This means that for each tuple of the outer relation we probe (set variables)
> in the inner relation.
> The goal of this issue is to add new algorithm (or extend the correlateJoin
> method) which first gathers blocks (batches) of tuples from the outer
> relation and then probes the inner relation once per block.
> There are cases (eg., indexes) where the inner relation can be accessed by
> more than one value which can greatly improve the performance in particular
> when the outer relation is big.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)