+1 for your proposal. You can add to the description.
Best, Jingsong On Mon, Apr 22, 2024 at 6:15 PM Wencong Liu <[email protected]> wrote: > > Hi Jinsong, > > > > > This topic requires discussion, hence it wasn't directly addressed in the PIP. > > > > I believe the type of sorting algorithm to use depends on the number of > fields specified by the user for comparison. When only one comparison field is > specified, it's best to use basic data types for direct comparison for the > most accurate > results. For multiple comparison fields, both the Z-order curve and Hilbert > curve algorithms > are suitable. In such cases, data maintains a certain level of order in any > comparison > field. Generally, the computation cost of the Z-order curve algorithm is lower > than that of the Hilbert curve algorithm. However, in high-dimensional > scenarios, the Hilbert curve has an advantage in maintaining data clustering. > > > Therefore, I propose an automatic selection based on the number of > comparison columns: > > > > > 1 column: Basic type comparison algorithm. > > Less than 5 columns: Z-order curve algorithm. > > 5 or more columns: Hilbert curve algorithm. > > > > > The threshold of 5 columns is based on Ververica's practice with Paimon > Append Scalable tables, which was also discussed offline with Junhao Ye. > In addition to automatic configuration, users can fine-tune for specific > scenarios by explicitly specifying the desired comparison strategy. > > > WDYT? > > > > Best, > > Wencong > > > > > > > > > > > > > > > > > > At 2024-04-22 15:08:09, "Jingsong Li" <[email protected]> wrote: > >Hi Wencong, > > > >Mostly looks good to me. > > > >"it will automatically determine the algorithm based on the number of > >columns in 'sink.clustering.by-columns'. " > > > >Please describe this clearly in the `Description`. > > > >Best, > >Jingsong > > > >On Mon, Apr 22, 2024 at 2:36 PM Wencong Liu <[email protected]> wrote: > >> > >> Hi devs, > >> > >> > >> > >> > >> I'm proposing a new feature to introduce range partitioning and sorting in > >> append scalable table > >> > >> writing for Flink. The goal is to optimize query performance by reducing > >> data scans on large datasets. > >> > >> > >> > >> > >> The proposal includes: > >> > >> > >> > >> > >> 1. Configurable range partitioning and sorting during data writing which > >> allows for > >> > >> a more efficient data distribution strategy. > >> > >> > >> > >> > >> 2. Introduction of new configurations that will enable users to specify > >> columns for > >> > >> comparison, choose a comparison algorithm for range partitioning, and > >> further sort each > >> > >> partition if required. > >> > >> > >> > >> > >> 3. Detailed explanation of the division of processing steps when range > >> partitioning > >> > >> is enabled and the conditional inclusion of the sorting phase. > >> > >> > >> > >> > >> Looking forward to discussing this in the upcoming PIP [1]. > >> > >> > >> > >> > >> Best regards, > >> > >> Wencong Liu > >> > >> > >> > >> > >> [1] > >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-21%3A+Introduce+Range+Partition+And+Sort+in+Append+Scalable+Table+Batch+Writing+for+Flink
