[PR] Pax/support bloom filter pushdown [cloudberry]

via GitHub Wed, 27 Aug 2025 02:28:09 -0700


gongxun0928 opened a new pull request, #1333:
URL: https://github.com/apache/cloudberry/pull/1333


   <!-- Thank you for your contribution to Apache Cloudberry (Incubating)! -->
   this pr is base on https://github.com/apache/cloudberry/pull/1324
   
   feat: pushdown bloom filter to pax table am
   
   This optimization pushes down Bloom Filter conditions for runtime filters
   to the Pax Table AM layer.
   
   By applying the filter earlier than the SeqNext() function, it eliminates
   the overhead of converting data from columnar format to TableTupleSlot,
   resulting in faster query execution
   
   ```
   CREATE TABLE t2(c1 int, c2 int, c3 int, c4 int, c5 int) with 
(appendonly=true, orientation=column) distributed REPLICATED;
   INSERT INTO t2 VALUES (1,1,1,1,1), (2,2,2,2,2), (3,3,3,3,3), (4,4,4,4,4);
   INSERT INTO t2 select * FROM t2;
   INSERT INTO t2 select * FROM t2;
   INSERT INTO t2 select * FROM t2;
   CREATE TABLE t3(c1 int, c2 int, c3 int, c4 int, c5 int) using pax;
   
   
   gpadmin=# insert into t3 select b,i,i,i,i from generate_series(1,5) 
b,generate_series(1,2000000) i;
   
   INSERT 0 10000000
   gpadmin=#
   gpadmin=# analyze t3;
   ANALYZE
   gpadmin=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF)
   SELECT t3.c3 FROM t3, t2 WHERE t3.c2 = t2.c4;
                                           QUERY PLAN
   
-------------------------------------------------------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3) (actual time=355.000..1482.000 
rows=160 loops=1)
      ->  Hash Join (actual time=0.000..1482.000 rows=96 loops=1)
            Hash Cond: (t3.c2 = t2.c4)
            Extra Text: (seg0)   Hash chain length 8.0 avg, 8 max, using 4 of 
524288 buckets.
            ->  Seq Scan on t3 (actual time=0.000..691.000 rows=6000000 loops=1)
            ->  Hash (actual time=0.000..0.000 rows=32 loops=1)
                  Buckets: 524288  Batches: 1  Memory Usage: 4098kB
                  ->  Seq Scan on t2 (actual time=0.000..0.000 rows=32 loops=1)
    Optimizer: GPORCA
   (9 rows)
   
   gpadmin=# set gp_enable_runtime_filter_pushdown to on;
   SET
   gpadmin=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF)
   SELECT t3.c3 FROM t3, t2 WHERE t3.c2 = t2.c4;
                                           QUERY PLAN
   
-------------------------------------------------------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3) (actual time=99.000..303.000 
rows=160 loops=1)
      ->  Hash Join (actual time=1.000..303.000 rows=96 loops=1)
            Hash Cond: (t3.c2 = t2.c4)
            Extra Text: (seg0)   Hash chain length 8.0 avg, 8 max, using 4 of 
524288 buckets.
            ->  Seq Scan on t3 (actual time=1.000..301.000 rows=12 loops=1)
                  Rows Removed by Pushdown Runtime Filter: 5999988
            ->  Hash (actual time=0.000..0.000 rows=32 loops=1)
                  Buckets: 524288  Batches: 1  Memory Usage: 4098kB
                  ->  Seq Scan on t2 (actual time=0.000..0.000 rows=32 loops=1)
    Optimizer: GPORCA
   (10 rows)
   
   gpadmin=# set pax_enable_row_filter to on;
   SET
   gpadmin=# EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF)
   SELECT t3.c3 FROM t3, t2 WHERE t3.c2 = t2.c4;
                                           QUERY PLAN
   
-------------------------------------------------------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3) (actual time=48.000..138.000 
rows=160 loops=1)
      ->  Hash Join (actual time=0.000..138.000 rows=96 loops=1)
            Hash Cond: (t3.c2 = t2.c4)
            Extra Text: (seg0)   Hash chain length 8.0 avg, 8 max, using 4 of 
524288 buckets.
            ->  Seq Scan on t3 (actual time=0.000..137.000 rows=12 loops=1)
                  Rows Removed by Pushdown Runtime Filter: 5999988
            ->  Hash (actual time=0.000..0.000 rows=32 loops=1)
                  Buckets: 524288  Batches: 1  Memory Usage: 4098kB
                  ->  Seq Scan on t2 (actual time=0.000..0.000 rows=32 loops=1)
    Optimizer: GPORCA
   (10 rows)
   ```
   
   
   ### What does this PR do?
   <!-- Brief overview of the changes, including any major features or fixes -->
   
   ### Type of Change
   - [ ] Bug fix (non-breaking change)
   - [ ] New feature (non-breaking change)
   - [ ] Breaking change (fix or feature with breaking changes)
   - [ ] Documentation update
   
   ### Breaking Changes
   <!-- Remove if not applicable. If yes, explain impact and migration path -->
   
   ### Test Plan
   <!-- How did you test these changes? -->
   - [ ] Unit tests added/updated
   - [ ] Integration tests added/updated
   - [ ] Passed `make installcheck`
   - [ ] Passed `make -C src/test installcheck-cbdb-parallel`
   
   ### Impact
   <!-- Remove sections that don't apply -->
   **Performance:**
   <!-- Any performance implications? -->
   
   **User-facing changes:**
   <!-- Any changes visible to users? -->
   
   **Dependencies:**
   <!-- New dependencies or version changes? -->
   
   ### Checklist
   - [ ] Followed [contribution 
guide](https://cloudberry.apache.org/contribute/code)
   - [ ] Added/updated documentation
   - [ ] Reviewed code for security implications
   - [ ] Requested review from [cloudberry 
committers](https://github.com/orgs/apache/teams/cloudberry-committers)
   
   ### Additional Context
   <!-- Any other information that would help reviewers? Remove if none -->
   
   ### CI Skip Instructions
   <!--
   To skip CI builds, add the appropriate CI skip identifier to your PR title.
   The identifier must:
   - Be in square brackets []
   - Include the word "ci" and either "skip" or "no"
   - Only use for documentation-only changes or when absolutely necessary
   -->
   
   ---
   <!-- Join our community:
   - Mailing list: 
[d...@cloudberry.apache.org](https://lists.apache.org/list.html?d...@cloudberry.apache.org)
 (subscribe: dev-subscr...@cloudberry.apache.org)
   - Discussions: https://github.com/apache/cloudberry/discussions -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudberry.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cloudberry.apache.org
For additional commands, e-mail: commits-h...@cloudberry.apache.org

[PR] Pax/support bloom filter pushdown [cloudberry]

Reply via email to