my-ship-it commented on issue #1234:
URL: https://github.com/apache/cloudberry/issues/1234#issuecomment-3082372282

   > Yes!
   > 
   > I want to believe we one time could reduce the number of rows in 
SubqueryScan. And use Forward and Backward filter pass )
   > 
   > That's what inspires me - [Debunking the Myth of Join Ordering: Toward 
Robust SQL Analytics](https://arxiv.org/pdf/2502.15181)
   > 
   > Here the DuckDB implementation - 
[duckdb/duckdb#17326](https://github.com/duckdb/duckdb/pull/17326)
   > 
   > What I cannot understand right now - how to use bloom filter in MPP 
environment? Is it enough to create local bloom filters?
   
   Hi @leborchuk, thanks for the suggestions and for providing other reference 
implementations in the industry, which is helpful.
   
   In Cloudberry, for queries with Motion, a bloom filter can be enabled after 
Motion to reduce the amount of data in Hash Join and improve performance, for 
example:
   
   ```
   
-----------------------------------------------------------------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1582.50 rows=4203 
width=8)
      ->  Hash Join  (cost=0.00..1582.37 rows=1401 width=8)
            Hash Cond: ((r.v + 1) = l.v)
            ->  Redistribute Motion 3:3  (slice2; segments: 3)  
(cost=0.00..1148.63 rows=14933 width=4).  
                  Hash Key: (r.v + 1)
                    Rows Removed by Pushdown Runtime Filter: 127
                  ->  Seq Scan on tbl2 r  (cost=0.00..1148.44 rows=14933 
width=4)
            ->  Hash  (cost=431.01..431.01 rows=334 width=4)
                  ->  Seq Scan on tbl1 l  (cost=0.00..431.01 rows=334 width=4)
    Optimizer: GPORCA
   
   ```
   
   However, the data motion volume remains relatively large. A better approach 
is to perform filtering before data is sent, like:
   ```
   
-----------------------------------------------------------------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1582.50 rows=4203 
width=8)
      ->  Hash Join  (cost=0.00..1582.37 rows=1401 width=8)
            Hash Cond: ((r.v + 1) = l.v)
            ->  Redistribute Motion 3:3  (slice2; segments: 3)  
(cost=0.00..1148.63 rows=14933 width=4).  
                  Hash Key: (r.v + 1)
                  ->  Seq Scan on tbl2 r  (cost=0.00..1148.44 rows=14933 
width=4)
                            Rows Removed by Pushdown Runtime Filter: 127
            ->  Hash  (cost=431.01..431.01 rows=334 width=4)
                  ->  Seq Scan on tbl1 l  (cost=0.00..431.01 rows=334 width=4)
    Optimizer: GPORCA
   
   ```
   
   Motion in Cloudberry only supports unidirectional transmission of data from 
the sender to the receiver, and cannot send data from the receiver to the 
sender. Unless we can modify Motion to support sending the Bloom filter from 
the receiver to the sender?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to