avamingli opened a new pull request, #990:
URL: https://github.com/apache/cloudberry/pull/990

   Previously, write operations on partitioned tables would alter the 
materialized view (MV) status of both ancestor and descendant tables, leading 
to unnecessary invalidations of MVs that depended on unaffected partitions. 
   
   This commit introduces a more efficient approach to updating MV statuses for 
partitioned tables by focusing on leaf partitions that have actually undergone 
data changes.
   
   ## example:
   
   ```sql
   create table par(a int, b int, c int) partition by range(b)
       subpartition by range(c) subpartition template (start (1) end (3) every 
(1))
       (start(1) end(3) every(1));
   insert into par values(1, 1, 1), (1, 1, 2), (2, 2, 1), (2, 2, 2);
   create materialized view mv_par as select * from par;
   create materialized view mv_par1 as select * from  par_1_prt_1;
   create materialized view mv_par1_1 as select * from par_1_prt_1_2_prt_1;
   create materialized view mv_par1_2 as select * from par_1_prt_1_2_prt_2;
   create materialized view mv_par2 as select * from  par_1_prt_2;
   create materialized view mv_par2_2 as select * from  par_1_prt_2_2_prt2;
   ```
   
   ![mv_par 
drawio](https://github.com/user-attachments/assets/9688221a-1822-4010-bf0a-67d5f9a0c8cc)
   
   ### INSERT INTO Partitioned Table
   ```sql
   INSERT INTO P1 VALUES (1, 1, 1);
   ```
   #### before this commit:
   
   ![old_insert 
drawio](https://github.com/user-attachments/assets/1f77a1c7-964a-4892-a74b-3edf9ddc503f)
   
   It is known that data was successfully inserted into P1, which is a 
descendant node of P0. Due to the increased data, the status needs to be 
propagated upwards to invalidate the materialized view MV_0 based on P0.
   
   At the same time, P1 is a partitioned table, and it also needs to propagate 
the status downwards to its child tables. Since it is unclear which child 
tables have undergone data changes, all materialized views of the child tables 
must be invalidated.
   
   Consequently, the materialized views MV_0, MV_1, MV_1_1, and MV_1_2 are all 
invalidated.
   
   #### after this commit:
   
   ![new_insert 
drawio](https://github.com/user-attachments/assets/ce28995b-eaaf-4266-9e97-e400997590dd)
   
   When data is inserted into a partitioned table, it is ultimately routed to 
the child tables at the leaf nodes. The system monitors these leaf nodes for 
any data changes and propagates the status upward, while ignoring the changes 
directly made to the partition table P0 in the SQL.
   
   The tuple (1, 1, 1) will be routed to the partition P1_1, invalidating the 
materialized views of P1_1 and its ancestor partition tables. In contrast, the 
other sub-partition P1_2 experiences no data changes, so the materialized view 
MV_1_2 remains valid.
   
   As a result, the materialized views MV_0, MV_1, and MV_1_1 are invalidated, 
while MV_1_2 remains unaffected.
   
   The new approach allows for status propagation only from the leaf nodes 
upward, rather than sending status updates in both directions through the 
intermediate nodes of the partition tree. This safeguards unrelated 
materialized views from unnecessary refreshes.
   
   ### (Split) Across Partition Update on root table
   ```sql
   UPDATE P0 SET c = 2 WHERE b = 2 AND c = 1;
   ```
   #### before this commit:
   
   ![old_update 
drawio](https://github.com/user-attachments/assets/80caa835-e5b5-4cc6-a095-1519720218dd)
   
   When the data in the root node is updated, and since it is unclear which 
specific child table was updated, the status is propagated downwards to all 
descendant child tables.
   
   As a result, all materialized views based on the partition tree are 
invalidated: MV_0, MV_1, MV_1_1, MV_1_2, and MV_2_2.
   
   #### after this commit:
   
   ![new_update 
drawio](https://github.com/user-attachments/assets/32c51952-a78f-4d23-9c47-97090a808f11)
   
   Since `c` is the partition key, updating the partition key will cause data 
movement **across partitions** in the child tables. The **UPDATE** operation on 
the parent table will be translated into an **INSERT** for one child table and 
a **DELETE** for another.
   
   The system monitors the changed leaf node child tables: P2_1 and P2_2, and 
only propagates their respective statuses upward, invalidating the materialized 
views based on their ancestor tables. The materialized views related to the 
left subtree P1 remain unaffected.
   
   As a result, MV_0 and MV_2_2 are invalidated, while MV_1, MV_1_1, and MV_1_2 
remain valid.
   
   ## Extend protocol of libpq
   
   The executor (QE) records the modified partition child tables during the 
data modification process. A mechanism is needed to transmit these results to 
the QD for updating the metadata. The libpq protocol has been extended to 
support this data transfer.
   
   ### Processing on the QE and QD sides:
   
   ![qe_qd 
drawio](https://github.com/user-attachments/assets/419bb47e-83b0-43ab-8f5a-2299dcab8d5d)
   
   Each QE records the IDs of the child tables it has modified, deduplicates 
them in a local bitmap, and sends this information to the QD.
   
   The data changes on each segment node may not be the same. For example, in 
the diagram, seg0 has changes in two tables, while seg1 has no changes (as the 
data to be updated is not located on the seg1 node due to distribution).
   
   The QD collects multiple bitmaps returned from the segment nodes and 
consolidates them into a single bitmap, removing duplicate relids. This 
deduplicated bitmap is then used to update the status of the ancestor nodes.
   
   ### Extend Protocol Design
   
   ![libpq 
drawio](https://github.com/user-attachments/assets/301e7de8-dd21-4b7e-a4af-f875f9d527ea)
   
   Based on libpq, the protocol has been extended with an **'e'** protocol, 
which stands for "extend protocol," to unify the handling of information 
returns.
   
   Each transmission begins with the **'e'** protocol, followed immediately by 
data blocks. A single transmission can carry multiple data blocks.
   
   Each data block corresponds to a specific type of data and is distinguished 
by a **subtag**. Following the subtag is a 32-bit integer, **num**, which 
indicates the total number of relids in the subsequent array.
   
   The relid array follows the num value. After all data blocks have been 
written, the transmission concludes with a **subtag_max**.
   
   The **subtag_max**, which belongs to the same enum category as the 
**subtag**, is specifically used to indicate the termination of an extended 
protocol transmission.
   
   ## Key changes:
   
   - Leaf Partition Tracking: Instead of updating the MV status based on the 
parent table, we now detect and track changes at the leaf partition level. This 
ensures that only partitions with actual data modifications trigger MV status 
updates, significantly reducing unnecessary refreshes.
   
   - Cross-Partition Updates: For operations like UPDATEs that span multiple 
partitions (e.g., decomposed into an INSERT on one leaf and a DELETE on 
another), the MV status is updated based on the real data state of the affected 
leaf partitions, rather than propagating changes through the parent table. This 
avoids invalidating MVs that depend on unrelated partitions.
   
   - Executor Enhancements: The executor now detects dynamic partition 
expansion during query execution and records data modification bitmaps for each 
partition on the QE. These bitmaps are aggregated on the QD to update MV 
statuses accurately.
   
   - Protocol Extension: The libpq protocol has been extended to handle QE 
feedback uniformly, ensuring that partition-level modification information is 
properly collected and consumed by the QD.
   
   This optimization minimizes the impact on MVs by ensuring that only relevant 
partitions trigger status updates, improving performance and reducing 
unnecessary invalidations.
   
   Authored-by: Zhang Mingli [email protected]
   
   <!-- Thank you for your contribution to Apache Cloudberry (Incubating)! -->
   
   Fixes #ISSUE_Number
   
   ### What does this PR do?
   <!-- Brief overview of the changes, including any major features or fixes -->
   
   ### Type of Change
   - [ ] Bug fix (non-breaking change)
   - [ ] New feature (non-breaking change)
   - [ ] Breaking change (fix or feature with breaking changes)
   - [ ] Documentation update
   
   ### Breaking Changes
   <!-- Remove if not applicable. If yes, explain impact and migration path -->
   
   ### Test Plan
   <!-- How did you test these changes? -->
   - [ ] Unit tests added/updated
   - [ ] Integration tests added/updated
   - [ ] Passed `make installcheck`
   - [ ] Passed `make -C src/test installcheck-cbdb-parallel`
   
   ### Impact
   <!-- Remove sections that don't apply -->
   **Performance:**
   <!-- Any performance implications? -->
   
   **User-facing changes:**
   <!-- Any changes visible to users? -->
   
   **Dependencies:**
   <!-- New dependencies or version changes? -->
   
   ### Checklist
   - [ ] Followed [contribution 
guide](https://cloudberry.apache.org/contribute/code)
   - [ ] Added/updated documentation
   - [ ] Reviewed code for security implications
   - [ ] Requested review from [cloudberry 
committers](https://github.com/orgs/apache/teams/cloudberry-committers)
   
   ### Additional Context
   <!-- Any other information that would help reviewers? Remove if none -->
   
   ### CI Skip Instructions
   <!--
   To skip CI builds, add the appropriate CI skip identifier to your PR title.
   The identifier must:
   - Be in square brackets []
   - Include the word "ci" and either "skip" or "no"
   - Only use for documentation-only changes or when absolutely necessary
   -->
   
   ---
   <!-- Join our community:
   - Mailing list: 
[[email protected]](https://lists.apache.org/[email protected])
 (subscribe: [email protected])
   - Discussions: https://github.com/apache/cloudberry/discussions -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to