[
https://issues.apache.org/jira/browse/CALCITE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842061#comment-16842061
]
Khawla Mouhoubi commented on CALCITE-2979:
------------------------------------------
I don't see a common solution for all types of joins but rather different
strategies depending on the join type.
For SEMI and ANTI an idea would be reading blocks from the inner table and
applying a filter to the outer table, just like [~rubenql] mentionned. These
two steps are enough since we do not need to keep data from inner table.
Drawback would be duplicates. How to handle duplicates is yet to be studied.
For INNER and LEFT join I see three options:
# Read blocks from the outer table and filter the inner one. In order to know
which correlation variable matched, return every correlation variable with
every tuple from the filtered inner table. Then apply another filter to get
only tuples where the correlation variable matches. Something like that:
{code:java}
// select * from books join author on books.fk = author.id
Filter(books.fk = author.id)
SimpleBlockCorrelate() // might generate false results, to be filtered by
above filter
Scan(author)
Filter(OR(=(cor0,book.fk), =(cor1,book.fk), =(cor2,book.fk))
Scan(book)
{code}
# Read blocks from the outer table and filter the inner table. Then filter the
block with each tuple from the filtered inner relation. Something like that:
{code:java}
// select * from books join author on books.fk = author.id
BlockCorrelate(condition: books.fk = author.id)
Scan(author)
Filter(OR(=(cor0,book.fk), =(cor1,book.fk), =(cor2,book.fk))
Scan(book)
// pseudo-code for block based NL
for corrVars in getBlockFromOuterEnumerable()
innerEnumerable = inner.apply(corrVars)
for inner in innerEnumerable
// Filter correlation variables
for cV in corrVars
if cV matches inner
// cV + inner is a result
{code}
So basically the two last steps (SimpleBlockCorrelate & Filter) of the previous
plan are merged into the BlockCorrelate.
# Keep current NL.
Advantages & drawbacks of each strategy:
|| ||Method 1||Method 2||Method 3||
||Description|Filter inner table with a block of correlation variables,
projecting each corr var with each tuple from filtered
table, re-filtering|Filter inner table with a block of correlation variables,
filter the block with each tuple from filtered inner table|Keep current NL|
||Advantages| Less scans than NL
New algorithm rather straightforward|Less scans than NL
BlockCorrelate only generate correct results| Keeps order|
||Drawbacks| SimpleBlockCorrelate generates false results
Doesn't keep order| Doesn't keep order
New algorithm not trivial| No improvement |
> Add a block-based nested loop join algorithm
> --------------------------------------------
>
> Key: CALCITE-2979
> URL: https://issues.apache.org/jira/browse/CALCITE-2979
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.19.0
> Reporter: Stamatis Zampetakis
> Assignee: Khawla Mouhoubi
> Priority: Major
> Labels: performance
>
> Currently, Calcite provides a tuple-based nested loop join algorithm
> implemented through EnumerableCorrelate and EnumerableDefaults.correlateJoin.
> This means that for each tuple of the outer relation we probe (set variables)
> in the inner relation.
> The goal of this issue is to add new algorithm (or extend the correlateJoin
> method) which first gathers blocks (batches) of tuples from the outer
> relation and then probes the inner relation once per block.
> There are cases (eg., indexes) where the inner relation can be accessed by
> more than one value which can greatly improve the performance in particular
> when the outer relation is big.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)