my-ship-it commented on issue #1234: URL: https://github.com/apache/cloudberry/issues/1234#issuecomment-3116302842
> > > Yes! > > > I want to believe we one time could reduce the number of rows in SubqueryScan. And use Forward and Backward filter pass ) > > > That's what inspires me - [Debunking the Myth of Join Ordering: Toward Robust SQL Analytics](https://arxiv.org/pdf/2502.15181) > > > Here the DuckDB implementation - [duckdb/duckdb#17326](https://github.com/duckdb/duckdb/pull/17326) > > > What I cannot understand right now - how to use bloom filter in MPP environment? Is it enough to create local bloom filters? > > > > > > Hi [@leborchuk](https://github.com/leborchuk), thanks for the suggestions and for providing other reference implementations in the industry, which is helpful. > > In Cloudberry, for queries with Motion, a bloom filter can be enabled after Motion to reduce the amount of data in Hash Join and improve performance, for example: > > ``` > > ----------------------------------------------------------------------------------------------------- > > Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..1582.50 rows=4203 width=8) > > -> Hash Join (cost=0.00..1582.37 rows=1401 width=8) > > Hash Cond: ((r.v + 1) = l.v) > > -> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..1148.63 rows=14933 width=4). > > Hash Key: (r.v + 1) > > Rows Removed by Pushdown Runtime Filter: 127 > > -> Seq Scan on tbl2 r (cost=0.00..1148.44 rows=14933 width=4) > > -> Hash (cost=431.01..431.01 rows=334 width=4) > > -> Seq Scan on tbl1 l (cost=0.00..431.01 rows=334 width=4) > > Optimizer: GPORCA > > ``` > > > > > > > > > > > > > > > > > > > > > > > > However, the data motion volume remains relatively large. A better approach is to perform filtering before data is sent, like: > > ``` > > ----------------------------------------------------------------------------------------------------- > > Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..1582.50 rows=4203 width=8) > > -> Hash Join (cost=0.00..1582.37 rows=1401 width=8) > > Hash Cond: ((r.v + 1) = l.v) > > -> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..1148.63 rows=14933 width=4). > > Hash Key: (r.v + 1) > > -> Seq Scan on tbl2 r (cost=0.00..1148.44 rows=14933 width=4) > > Rows Removed by Pushdown Runtime Filter: 127 > > -> Hash (cost=431.01..431.01 rows=334 width=4) > > -> Seq Scan on tbl1 l (cost=0.00..431.01 rows=334 width=4) > > Optimizer: GPORCA > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Motion in Cloudberry only supports unidirectional transmission of data from the sender to the receiver, and cannot send data from the receiver to the sender. Unless we can modify Motion to support sending the Bloom filter from the receiver to the sender? > > For this issue, I believe we have a more elegant and broadly applicable solution that is not only suitable for sharing bloom filter information across nodes, but also applicable to globally sharing all runtime state information. > > There is a paper we learn from: [Anser: Adaptive Information Sharing Framework of AnalyticDB](https://dl.acm.org/doi/10.14778/3611540.3611553). Thank Lirong for the valuable insights. The paper is worthy of further study by us, which provides inspiration on how to share information in distributed systems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
