Hi Jinsong,



This topic requires discussion, hence it wasn't directly addressed in the PIP.



I believe the type of sorting algorithm to use depends on the number of 
fields specified by the user for comparison. When only one comparison field is 
specified, it's best to use basic data types for direct comparison for the most 
accurate 
results. For multiple comparison fields, both the Z-order curve and Hilbert 
curve algorithms 
are suitable. In such cases, data maintains a certain level of order in any 
comparison 
field. Generally, the computation cost of the Z-order curve algorithm is lower 
than that of the Hilbert curve algorithm. However, in high-dimensional 
scenarios, the Hilbert curve has an advantage in maintaining data clustering.


Therefore, I propose an automatic selection based on the number of 
comparison columns:




1 column: Basic type comparison algorithm.

Less than 5 columns: Z-order curve algorithm.

5 or more columns: Hilbert curve algorithm.




The threshold of 5 columns is based on Ververica's practice with Paimon 
Append Scalable tables, which was also discussed offline with Junhao Ye. 
In addition to automatic configuration, users can fine-tune for specific 
scenarios by explicitly specifying the desired comparison strategy.


WDYT?



Best,

Wencong

















At 2024-04-22 15:08:09, "Jingsong Li" <[email protected]> wrote:
>Hi Wencong,
>
>Mostly looks good to me.
>
>"it will automatically determine the algorithm based on the number of
>columns in 'sink.clustering.by-columns'. "
>
>Please describe this clearly in the `Description`.
>
>Best,
>Jingsong
>
>On Mon, Apr 22, 2024 at 2:36 PM Wencong Liu <[email protected]> wrote:
>>
>> Hi devs,
>>
>>
>>
>>
>> I'm proposing a new feature to introduce range partitioning and sorting in 
>> append scalable table
>>
>> writing for Flink. The goal is to optimize query performance by reducing 
>> data scans on large datasets.
>>
>>
>>
>>
>> The proposal includes:
>>
>>
>>
>>
>> 1. Configurable range partitioning and sorting during data writing which 
>> allows for
>>
>> a more efficient data distribution strategy.
>>
>>
>>
>>
>> 2. Introduction of new configurations that will enable users to specify 
>> columns for
>>
>> comparison, choose a comparison algorithm for range partitioning, and 
>> further sort each
>>
>> partition if required.
>>
>>
>>
>>
>> 3. Detailed explanation of the division of processing steps when range 
>> partitioning
>>
>> is enabled and the conditional inclusion of the sorting phase.
>>
>>
>>
>>
>> Looking forward to discussing this in the upcoming PIP [1].
>>
>>
>>
>>
>> Best regards,
>>
>> Wencong Liu
>>
>>
>>
>>
>> [1] 
>> https://cwiki.apache.org/confluence/display/PAIMON/PIP-21%3A+Introduce+Range+Partition+And+Sort+in+Append+Scalable+Table+Batch+Writing+for+Flink

Reply via email to