Re: Feasibility of per-partition instead of per-table bucket count

Jingsong Li Wed, 25 Feb 2026 22:53:14 -0800

Look forward to hearing from you.

Best,
Jingsong


On Thu, Feb 26, 2026 at 2:37 PM Mike Dias <[email protected]> wrote:
>
> Oh, that is perfect! Thank you!
>
> I think we should be good to add this feature then. We are currently testing 
> this patch internally and once we are happy with it, we can submit it as a PR 
> to the main repository, if that is okay with you.
>
> On Thu, Feb 26, 2026 at 5:33 PM Jingsong Li <[email protected]> wrote:
>>
>> Hi Mike,
>>
>> For the second scenario, here is an option:
>> 'commit.strict-mode.last-safe-snapshot'. If you are using
>> RescaleAction, it will set this option to check this scenario.
>>
>> Best,
>> Jingsong
>>
>> On Thu, Feb 26, 2026 at 2:27 PM Mike Dias <[email protected]> wrote:
>> >
>> > Thanks, Jingsong!
>> >
>> > It seems we already check the number of buckets being equal when 
>> > committing here -> 
>> > https://github.com/apache/paimon/blob/e1eeec56954c19ed78fd0bd4a46e0a332443397d/paimon-core/src/main/java/org/apache/paimon/operation/commit/ConflictDetection.java#L219.
>> >
>> > I think that should capture the first scenario where:
>> >
>> > writer starts
>> > rescale starts
>> > rescale commits
>> > writer commits -> fails because the number of buckets changed
>> >
>> > I don't think it would address the second scenario where:
>> >
>> > rescale starts
>> > writer starts
>> > writes commits
>> > rescale commits -> previous commit is overwritten
>> >
>> > Is my understanding correct? Not sure if it is possible to detect the 
>> > second scenario, though... users will need to ensure that no writer is 
>> > running/started duing the rescaling process.
>> >
>> >
>> > On Thu, Feb 26, 2026 at 3:24 PM Jingsong Li <[email protected]> wrote:
>> >>
>> >> Hi Mike,
>> >>
>> >> This is a good question.
>> >>
>> >> As far as I know, Paimon does not strictly check that all partitions
>> >> must have the same number of buckets. It is possible to achieve
>> >> different buckets for different partitions, but it is more complex. We
>> >> may need to scan the manifests when writing to ensure that the number
>> >> of buckets written to the partitions is the same as before, otherwise
>> >> it will cause inconsistent data correctness issues.
>> >>
>> >> Best,
>> >> Jingsong
>> >>
>> >> On Mon, Feb 16, 2026 at 1:19 PM Mike Dias via dev <[email protected]> 
>> >> wrote:
>> >> >
>> >> > Hi Paimon maintainers,
>> >> >
>> >> > I'm looking to implement a change that would allow different partitions
>> >> > within a PK fixed-bucket table to have different bucket counts, 
>> >> > primarily
>> >> > to support highly skewed partitions with more/fewer buckets.
>> >> >
>> >> > We would use dynamic buckets to handle skew, but we really need multiple
>> >> > writers writing to the same active partitions in both streaming and 
>> >> > batch,
>> >> > which doesn't seem to be something we could easily support with dynamic
>> >> > buckets without coordinating changes to the bucket index file...
>> >> >
>> >> > On the fixed-buckets side, though, it seems we are in a good spot to
>> >> > implement per-partition bucketing, and this rescale doc
>> >> > <https://paimon.apache.org/docs/1.3/maintenance/rescale-bucket/> 
>> >> > suggests
>> >> > we can already do that for partitions that aren't receiving writes.
>> >> > Unfortunately, our partitions are not time-based, and most of them are
>> >> > always receiving writes...
>> >> >
>> >> > Hence, we would need to adapt the current code to allow writers to look 
>> >> > up
>> >> > the bucket counts from the manifest partition rather than relying on the
>> >> > global table bucket count.
>> >> >
>> >> > That brings me to the following questions:
>> >> >
>> >> >    1. *Can we actually do this?:* Are there architectural reasons why
>> >> >    bucket counts must be uniform across all partitions? Are there 
>> >> > assumptions
>> >> >    elsewhere in the codebase that depend on a single global bucket 
>> >> > count?
>> >> >    2. *Concurrent writers:* If multiple writers are active, they each
>> >> >    independently load the partition bucket mapping at initialization, 
>> >> > which
>> >> >    creates a risk of inconsistency if a rescale operation completes 
>> >> > between
>> >> >    when different writers load their mappings. This is not too 
>> >> > different from
>> >> >    the existing behavior, but with a global bucket count, it is much 
>> >> > easier to
>> >> >    safeguard against it. Do you have ideas on how we could mitigate 
>> >> > this issue
>> >> >    or warn users against this pitfall?
>> >> >    3. *Read path:* On the read side, does the scan/split logic already
>> >> >    handle partitions with heterogeneous bucket counts, or would changes 
>> >> > be
>> >> >    needed there as well?
>> >> >
>> >> >
>> >> > Any guidance on gotchas or prior art in this area would be greatly
>> >> > appreciated. Happy to share the full diff or open a draft PR if that 
>> >> > would
>> >> > be easier to review.
>> >> >
>> >> > --
>> >> > Thanks,
>> >> > Mike Dias
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Mike Dias
>
>
>
> --
> Thanks,
> Mike Dias

Re: Feasibility of per-partition instead of per-table bucket count

Reply via email to