Hi Jinsong,
This topic requires discussion, hence it wasn't directly addressed in the PIP. I believe the type of sorting algorithm to use depends on the number of fields specified by the user for comparison. When only one comparison field is specified, it's best to use basic data types for direct comparison for the most accurate results. For multiple comparison fields, both the Z-order curve and Hilbert curve algorithms are suitable. In such cases, data maintains a certain level of order in any comparison field. Generally, the computation cost of the Z-order curve algorithm is lower than that of the Hilbert curve algorithm. However, in high-dimensional scenarios, the Hilbert curve has an advantage in maintaining data clustering. Therefore, I propose an automatic selection based on the number of comparison columns: 1 column: Basic type comparison algorithm. Less than 5 columns: Z-order curve algorithm. 5 or more columns: Hilbert curve algorithm. The threshold of 5 columns is based on Ververica's practice with Paimon Append Scalable tables, which was also discussed offline with Junhao Ye. In addition to automatic configuration, users can fine-tune for specific scenarios by explicitly specifying the desired comparison strategy. WDYT? Best, Wencong At 2024-04-22 15:08:09, "Jingsong Li" <[email protected]> wrote: >Hi Wencong, > >Mostly looks good to me. > >"it will automatically determine the algorithm based on the number of >columns in 'sink.clustering.by-columns'. " > >Please describe this clearly in the `Description`. > >Best, >Jingsong > >On Mon, Apr 22, 2024 at 2:36 PM Wencong Liu <[email protected]> wrote: >> >> Hi devs, >> >> >> >> >> I'm proposing a new feature to introduce range partitioning and sorting in >> append scalable table >> >> writing for Flink. The goal is to optimize query performance by reducing >> data scans on large datasets. >> >> >> >> >> The proposal includes: >> >> >> >> >> 1. Configurable range partitioning and sorting during data writing which >> allows for >> >> a more efficient data distribution strategy. >> >> >> >> >> 2. Introduction of new configurations that will enable users to specify >> columns for >> >> comparison, choose a comparison algorithm for range partitioning, and >> further sort each >> >> partition if required. >> >> >> >> >> 3. Detailed explanation of the division of processing steps when range >> partitioning >> >> is enabled and the conditional inclusion of the sorting phase. >> >> >> >> >> Looking forward to discussing this in the upcoming PIP [1]. >> >> >> >> >> Best regards, >> >> Wencong Liu >> >> >> >> >> [1] >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-21%3A+Introduce+Range+Partition+And+Sort+in+Append+Scalable+Table+Batch+Writing+for+Flink
