Hi everyone,

I'd like to start a discussion about PIP-36: Introduce Incremental Clustering 
for Paimon Append Table [1].


Paimon currently supports ordering append tables using SFC (Space-Filling 
Curve)[2]. The resulting data layout typically delivers better performance for 
queries that target clustering keys. However, with the current SortCompact, 
even when neither the data nor the clustering keys have changed, each run still 
rewrites the entire dataset, which is extremely costly. To address this, we 
plan to introduce a more flexible, incremental clustering mechanism—Incremental 
Clustering. On each run, it selects only a specific subset of files to cluster, 
avoiding a full rewrite. This enables low-cost, sort-based optimization of the 
data layout and improves query performance. In addition, with Incremental 
Clustering, you can adjust clustering keys without rewriting existing data, the 
layout evolves dynamically as cluster runs and gradually converges to an 
optimal state, significantly reducing the decision-making complexity around 
data layout.


Incremental Clustering supports:

  *   Support incremental clustering; minimizing write amplification as 
possible.
  *   Support small-file compaction; during rewrites, respect target-file-size.
  *   Support changing clustering keys; newly ingested data is clustered 
according to the latest clustering keys.
  *   Provide a full mode; when selected, the entire dataset is reclustered.


The detailed design and PoC results can be see in PIP-36[1].


Looking forward to your feedback, thanks!


[1] 
https://cwiki.apache.org/confluence/display/PAIMON/PIP-36%3A+Introduce+Incremental+Clustering+for+Paimon+Append+Table[2]
 
https://paimon.apache.org/docs/master/maintenance/dedicated-compaction/#sort-compact
Best,

Lei Li

Reply via email to