[ 
https://issues.apache.org/jira/browse/CALCITE-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842061#comment-16842061
 ] 

Khawla Mouhoubi commented on CALCITE-2979:
------------------------------------------

I don't see a common solution for all types of joins but rather different 
strategies depending on the join type.

For SEMI and ANTI an idea would be reading blocks from the inner table and 
applying a filter to the outer table, just like [~rubenql] mentionned. These 
two steps are enough since we do not need to keep data from inner table. 
Drawback would be duplicates. How to handle duplicates is yet to be studied.

For INNER and LEFT join I see three options:
 # Read blocks from the outer table and filter the inner one. In order to know 
which correlation variable matched, return every correlation variable with 
every tuple from the filtered inner table. Then apply another filter to get 
only tuples where the correlation variable matches. Something like that:
{code:java}
// select * from books join author on books.fk = author.id
Filter(books.fk = author.id)
  SimpleBlockCorrelate() // might generate false results, to be filtered by 
above filter
     Scan(author)
     Filter(OR(=(cor0,book.fk), =(cor1,book.fk), =(cor2,book.fk))
       Scan(book)
{code}

 # Read blocks from the outer table and filter the inner table. Then filter the 
block with each tuple from the filtered inner relation. Something like that:
{code:java}
// select * from books join author on books.fk = author.id
  BlockCorrelate(condition: books.fk = author.id)
     Scan(author)
     Filter(OR(=(cor0,book.fk), =(cor1,book.fk), =(cor2,book.fk))
       Scan(book)

// pseudo-code for block based NL

for corrVars in getBlockFromOuterEnumerable()
   innerEnumerable = inner.apply(corrVars)
   for inner in innerEnumerable
      // Filter correlation variables 
      for cV in corrVars
          if cV matches inner
             // cV + inner is a result
{code}
So basically the two last steps (SimpleBlockCorrelate & Filter) of the previous 
plan are merged into the BlockCorrelate.

 # Keep current NL.

Advantages & drawbacks of each strategy:
|| ||Method 1||Method 2||Method 3||
||Description|Filter inner table with a block of correlation variables, 
 projecting each corr var with each tuple from filtered
 table, re-filtering|Filter inner table with a block of correlation variables,
 filter the block with each tuple from filtered inner table|Keep current NL|
||Advantages| Less scans than NL
 New algorithm rather straightforward|Less scans than NL
 BlockCorrelate only generate correct results| Keeps order|
||Drawbacks| SimpleBlockCorrelate generates false results
 Doesn't keep order| Doesn't keep order
 New algorithm not trivial| No improvement |

> Add a block-based nested loop join algorithm
> --------------------------------------------
>
>                 Key: CALCITE-2979
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2979
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Khawla Mouhoubi
>            Priority: Major
>              Labels: performance
>
> Currently, Calcite provides a tuple-based nested loop join algorithm 
> implemented through EnumerableCorrelate and EnumerableDefaults.correlateJoin. 
> This means that for each tuple of the outer relation we probe (set variables) 
> in the inner relation.
> The goal of this issue is to add new algorithm (or extend the correlateJoin 
> method) which first gathers blocks (batches) of tuples from the outer 
> relation and then probes the inner relation once per block.
> There are cases (eg., indexes) where the inner relation can be accessed by 
> more than one value which can greatly improve the performance in particular 
> when the outer relation is big.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to