Re: Feasibility of per-partition instead of per-table bucket count

Mike Dias via dev Wed, 25 Feb 2026 22:38:22 -0800

Oh, that is perfect! Thank you!

I think we should be good to add this feature then. We are currently
testing this patch internally and once we are happy with it, we can submit
it as a PR to the main repository, if that is okay with you.


On Thu, Feb 26, 2026 at 5:33 PM Jingsong Li <[email protected]> wrote:

> Hi Mike,
>
> For the second scenario, here is an option:
> 'commit.strict-mode.last-safe-snapshot'. If you are using
> RescaleAction, it will set this option to check this scenario.
>
> Best,
> Jingsong
>
> On Thu, Feb 26, 2026 at 2:27 PM Mike Dias <[email protected]> wrote:
> >
> > Thanks, Jingsong!
> >
> > It seems we already check the number of buckets being equal when
> committing here ->
> https://github.com/apache/paimon/blob/e1eeec56954c19ed78fd0bd4a46e0a332443397d/paimon-core/src/main/java/org/apache/paimon/operation/commit/ConflictDetection.java#L219
> .
> >
> > I think that should capture the first scenario where:
> >
> > writer starts
> > rescale starts
> > rescale commits
> > writer commits -> fails because the number of buckets changed
> >
> > I don't think it would address the second scenario where:
> >
> > rescale starts
> > writer starts
> > writes commits
> > rescale commits -> previous commit is overwritten
> >
> > Is my understanding correct? Not sure if it is possible to detect the
> second scenario, though... users will need to ensure that no writer is
> running/started duing the rescaling process.
> >
> >
> > On Thu, Feb 26, 2026 at 3:24 PM Jingsong Li <[email protected]>
> wrote:
> >>
> >> Hi Mike,
> >>
> >> This is a good question.
> >>
> >> As far as I know, Paimon does not strictly check that all partitions
> >> must have the same number of buckets. It is possible to achieve
> >> different buckets for different partitions, but it is more complex. We
> >> may need to scan the manifests when writing to ensure that the number
> >> of buckets written to the partitions is the same as before, otherwise
> >> it will cause inconsistent data correctness issues.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Mon, Feb 16, 2026 at 1:19 PM Mike Dias via dev <
> [email protected]> wrote:
> >> >
> >> > Hi Paimon maintainers,
> >> >
> >> > I'm looking to implement a change that would allow different
> partitions
> >> > within a PK fixed-bucket table to have different bucket counts,
> primarily
> >> > to support highly skewed partitions with more/fewer buckets.
> >> >
> >> > We would use dynamic buckets to handle skew, but we really need
> multiple
> >> > writers writing to the same active partitions in both streaming and
> batch,
> >> > which doesn't seem to be something we could easily support with
> dynamic
> >> > buckets without coordinating changes to the bucket index file...
> >> >
> >> > On the fixed-buckets side, though, it seems we are in a good spot to
> >> > implement per-partition bucketing, and this rescale doc
> >> > <https://paimon.apache.org/docs/1.3/maintenance/rescale-bucket/>
> suggests
> >> > we can already do that for partitions that aren't receiving writes.
> >> > Unfortunately, our partitions are not time-based, and most of them are
> >> > always receiving writes...
> >> >
> >> > Hence, we would need to adapt the current code to allow writers to
> look up
> >> > the bucket counts from the manifest partition rather than relying on
> the
> >> > global table bucket count.
> >> >
> >> > That brings me to the following questions:
> >> >
> >> >    1. *Can we actually do this?:* Are there architectural reasons why
> >> >    bucket counts must be uniform across all partitions? Are there
> assumptions
> >> >    elsewhere in the codebase that depend on a single global bucket
> count?
> >> >    2. *Concurrent writers:* If multiple writers are active, they each
> >> >    independently load the partition bucket mapping at initialization,
> which
> >> >    creates a risk of inconsistency if a rescale operation completes
> between
> >> >    when different writers load their mappings. This is not too
> different from
> >> >    the existing behavior, but with a global bucket count, it is much
> easier to
> >> >    safeguard against it. Do you have ideas on how we could mitigate
> this issue
> >> >    or warn users against this pitfall?
> >> >    3. *Read path:* On the read side, does the scan/split logic already
> >> >    handle partitions with heterogeneous bucket counts, or would
> changes be
> >> >    needed there as well?
> >> >
> >> >
> >> > Any guidance on gotchas or prior art in this area would be greatly
> >> > appreciated. Happy to share the full diff or open a draft PR if that
> would
> >> > be easier to review.
> >> >
> >> > --
> >> > Thanks,
> >> > Mike Dias
> >
> >
> >
> > --
> > Thanks,
> > Mike Dias
>


-- 
Thanks,
Mike Dias

Re: Feasibility of per-partition instead of per-table bucket count

Reply via email to