Re: Re: [DISCUSS] PIP-21: Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink

Jingsong Li Mon, 22 Apr 2024 03:20:51 -0700

+1 for your proposal.

You can add to the description.


Best,
Jingsong

On Mon, Apr 22, 2024 at 6:15 PM Wencong Liu <[email protected]> wrote:
>
> Hi Jinsong,
>
>
>
>
> This topic requires discussion, hence it wasn't directly addressed in the PIP.
>
>
>
> I believe the type of sorting algorithm to use depends on the number of
> fields specified by the user for comparison. When only one comparison field is
> specified, it's best to use basic data types for direct comparison for the 
> most accurate
> results. For multiple comparison fields, both the Z-order curve and Hilbert 
> curve algorithms
> are suitable. In such cases, data maintains a certain level of order in any 
> comparison
> field. Generally, the computation cost of the Z-order curve algorithm is lower
> than that of the Hilbert curve algorithm. However, in high-dimensional
> scenarios, the Hilbert curve has an advantage in maintaining data clustering.
>
>
> Therefore, I propose an automatic selection based on the number of
> comparison columns:
>
>
>
>
> 1 column: Basic type comparison algorithm.
>
> Less than 5 columns: Z-order curve algorithm.
>
> 5 or more columns: Hilbert curve algorithm.
>
>
>
>
> The threshold of 5 columns is based on Ververica's practice with Paimon
> Append Scalable tables, which was also discussed offline with Junhao Ye.
> In addition to automatic configuration, users can fine-tune for specific
> scenarios by explicitly specifying the desired comparison strategy.
>
>
> WDYT?
>
>
>
> Best,
>
> Wencong
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2024-04-22 15:08:09, "Jingsong Li" <[email protected]> wrote:
> >Hi Wencong,
> >
> >Mostly looks good to me.
> >
> >"it will automatically determine the algorithm based on the number of
> >columns in 'sink.clustering.by-columns'. "
> >
> >Please describe this clearly in the `Description`.
> >
> >Best,
> >Jingsong
> >
> >On Mon, Apr 22, 2024 at 2:36 PM Wencong Liu <[email protected]> wrote:
> >>
> >> Hi devs,
> >>
> >>
> >>
> >>
> >> I'm proposing a new feature to introduce range partitioning and sorting in 
> >> append scalable table
> >>
> >> writing for Flink. The goal is to optimize query performance by reducing 
> >> data scans on large datasets.
> >>
> >>
> >>
> >>
> >> The proposal includes:
> >>
> >>
> >>
> >>
> >> 1. Configurable range partitioning and sorting during data writing which 
> >> allows for
> >>
> >> a more efficient data distribution strategy.
> >>
> >>
> >>
> >>
> >> 2. Introduction of new configurations that will enable users to specify 
> >> columns for
> >>
> >> comparison, choose a comparison algorithm for range partitioning, and 
> >> further sort each
> >>
> >> partition if required.
> >>
> >>
> >>
> >>
> >> 3. Detailed explanation of the division of processing steps when range 
> >> partitioning
> >>
> >> is enabled and the conditional inclusion of the sorting phase.
> >>
> >>
> >>
> >>
> >> Looking forward to discussing this in the upcoming PIP [1].
> >>
> >>
> >>
> >>
> >> Best regards,
> >>
> >> Wencong Liu
> >>
> >>
> >>
> >>
> >> [1] 
> >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-21%3A+Introduce+Range+Partition+And+Sort+in+Append+Scalable+Table+Batch+Writing+for+Flink

Re: Re: [DISCUSS] PIP-21: Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink

Reply via email to