Hi,Wencong I have two small questions. 1. Add record will be emitted from `Upstream Operator` to `Local Sample`? If not, what is the sample rule? 2. From pip, I infer that the record in `Assign Range Index` should wait for the broadcast result from `Global Sample`,So How long do they wait? Until all records have been consumed by `Local Sample` or not?
Best, wangwj On Mon, Apr 22, 2024 at 6:20 PM Jingsong Li <[email protected]> wrote: > > +1 for your proposal. > > You can add to the description. > > Best, > Jingsong > > On Mon, Apr 22, 2024 at 6:15 PM Wencong Liu <[email protected]> wrote: > > > > Hi Jinsong, > > > > > > > > > > This topic requires discussion, hence it wasn't directly addressed in the > > PIP. > > > > > > > > I believe the type of sorting algorithm to use depends on the number of > > fields specified by the user for comparison. When only one comparison field > > is > > specified, it's best to use basic data types for direct comparison for the > > most accurate > > results. For multiple comparison fields, both the Z-order curve and Hilbert > > curve algorithms > > are suitable. In such cases, data maintains a certain level of order in any > > comparison > > field. Generally, the computation cost of the Z-order curve algorithm is > > lower > > than that of the Hilbert curve algorithm. However, in high-dimensional > > scenarios, the Hilbert curve has an advantage in maintaining data > > clustering. > > > > > > Therefore, I propose an automatic selection based on the number of > > comparison columns: > > > > > > > > > > 1 column: Basic type comparison algorithm. > > > > Less than 5 columns: Z-order curve algorithm. > > > > 5 or more columns: Hilbert curve algorithm. > > > > > > > > > > The threshold of 5 columns is based on Ververica's practice with Paimon > > Append Scalable tables, which was also discussed offline with Junhao Ye. > > In addition to automatic configuration, users can fine-tune for specific > > scenarios by explicitly specifying the desired comparison strategy. > > > > > > WDYT? > > > > > > > > Best, > > > > Wencong > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2024-04-22 15:08:09, "Jingsong Li" <[email protected]> wrote: > > >Hi Wencong, > > > > > >Mostly looks good to me. > > > > > >"it will automatically determine the algorithm based on the number of > > >columns in 'sink.clustering.by-columns'. " > > > > > >Please describe this clearly in the `Description`. > > > > > >Best, > > >Jingsong > > > > > >On Mon, Apr 22, 2024 at 2:36 PM Wencong Liu <[email protected]> wrote: > > >> > > >> Hi devs, > > >> > > >> > > >> > > >> > > >> I'm proposing a new feature to introduce range partitioning and sorting > > >> in append scalable table > > >> > > >> writing for Flink. The goal is to optimize query performance by reducing > > >> data scans on large datasets. > > >> > > >> > > >> > > >> > > >> The proposal includes: > > >> > > >> > > >> > > >> > > >> 1. Configurable range partitioning and sorting during data writing which > > >> allows for > > >> > > >> a more efficient data distribution strategy. > > >> > > >> > > >> > > >> > > >> 2. Introduction of new configurations that will enable users to specify > > >> columns for > > >> > > >> comparison, choose a comparison algorithm for range partitioning, and > > >> further sort each > > >> > > >> partition if required. > > >> > > >> > > >> > > >> > > >> 3. Detailed explanation of the division of processing steps when range > > >> partitioning > > >> > > >> is enabled and the conditional inclusion of the sorting phase. > > >> > > >> > > >> > > >> > > >> Looking forward to discussing this in the upcoming PIP [1]. > > >> > > >> > > >> > > >> > > >> Best regards, > > >> > > >> Wencong Liu > > >> > > >> > > >> > > >> > > >> [1] > > >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-21%3A+Introduce+Range+Partition+And+Sort+in+Append+Scalable+Table+Batch+Writing+for+Flink
