Re: Feasibility of per-partition instead of per-table bucket count

Mike Dias via dev Fri, 15 May 2026 14:27:48 -0700

Hey team, here is the PR for this feature ->
https://github.com/apache/paimon/pull/7865


Looking forward to getting your feedback!

On Thu, Feb 26, 2026 at 5:52 PM Jingsong Li <[email protected]> wrote:

> Look forward to hearing from you.
>
> Best,
> Jingsong
>
> On Thu, Feb 26, 2026 at 2:37 PM Mike Dias <[email protected]> wrote:
> >
> > Oh, that is perfect! Thank you!
> >
> > I think we should be good to add this feature then. We are currently
> testing this patch internally and once we are happy with it, we can submit
> it as a PR to the main repository, if that is okay with you.
> >
> > On Thu, Feb 26, 2026 at 5:33 PM Jingsong Li <[email protected]>
> wrote:
> >>
> >> Hi Mike,
> >>
> >> For the second scenario, here is an option:
> >> 'commit.strict-mode.last-safe-snapshot'. If you are using
> >> RescaleAction, it will set this option to check this scenario.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Thu, Feb 26, 2026 at 2:27 PM Mike Dias <[email protected]> wrote:
> >> >
> >> > Thanks, Jingsong!
> >> >
> >> > It seems we already check the number of buckets being equal when
> committing here ->
> https://github.com/apache/paimon/blob/e1eeec56954c19ed78fd0bd4a46e0a332443397d/paimon-core/src/main/java/org/apache/paimon/operation/commit/ConflictDetection.java#L219
> .
> >> >
> >> > I think that should capture the first scenario where:
> >> >
> >> > writer starts
> >> > rescale starts
> >> > rescale commits
> >> > writer commits -> fails because the number of buckets changed
> >> >
> >> > I don't think it would address the second scenario where:
> >> >
> >> > rescale starts
> >> > writer starts
> >> > writes commits
> >> > rescale commits -> previous commit is overwritten
> >> >
> >> > Is my understanding correct? Not sure if it is possible to detect the
> second scenario, though... users will need to ensure that no writer is
> running/started duing the rescaling process.
> >> >
> >> >
> >> > On Thu, Feb 26, 2026 at 3:24 PM Jingsong Li <[email protected]>
> wrote:
> >> >>
> >> >> Hi Mike,
> >> >>
> >> >> This is a good question.
> >> >>
> >> >> As far as I know, Paimon does not strictly check that all partitions
> >> >> must have the same number of buckets. It is possible to achieve
> >> >> different buckets for different partitions, but it is more complex.
> We
> >> >> may need to scan the manifests when writing to ensure that the number
> >> >> of buckets written to the partitions is the same as before, otherwise
> >> >> it will cause inconsistent data correctness issues.
> >> >>
> >> >> Best,
> >> >> Jingsong
> >> >>
> >> >> On Mon, Feb 16, 2026 at 1:19 PM Mike Dias via dev <
> [email protected]> wrote:
> >> >> >
> >> >> > Hi Paimon maintainers,
> >> >> >
> >> >> > I'm looking to implement a change that would allow different
> partitions
> >> >> > within a PK fixed-bucket table to have different bucket counts,
> primarily
> >> >> > to support highly skewed partitions with more/fewer buckets.
> >> >> >
> >> >> > We would use dynamic buckets to handle skew, but we really need
> multiple
> >> >> > writers writing to the same active partitions in both streaming
> and batch,
> >> >> > which doesn't seem to be something we could easily support with
> dynamic
> >> >> > buckets without coordinating changes to the bucket index file...
> >> >> >
> >> >> > On the fixed-buckets side, though, it seems we are in a good spot
> to
> >> >> > implement per-partition bucketing, and this rescale doc
> >> >> > <https://paimon.apache.org/docs/1.3/maintenance/rescale-bucket/>
> suggests
> >> >> > we can already do that for partitions that aren't receiving writes.
> >> >> > Unfortunately, our partitions are not time-based, and most of them
> are
> >> >> > always receiving writes...
> >> >> >
> >> >> > Hence, we would need to adapt the current code to allow writers to
> look up
> >> >> > the bucket counts from the manifest partition rather than relying
> on the
> >> >> > global table bucket count.
> >> >> >
> >> >> > That brings me to the following questions:
> >> >> >
> >> >> >    1. *Can we actually do this?:* Are there architectural reasons
> why
> >> >> >    bucket counts must be uniform across all partitions? Are there
> assumptions
> >> >> >    elsewhere in the codebase that depend on a single global bucket
> count?
> >> >> >    2. *Concurrent writers:* If multiple writers are active, they
> each
> >> >> >    independently load the partition bucket mapping at
> initialization, which
> >> >> >    creates a risk of inconsistency if a rescale operation
> completes between
> >> >> >    when different writers load their mappings. This is not too
> different from
> >> >> >    the existing behavior, but with a global bucket count, it is
> much easier to
> >> >> >    safeguard against it. Do you have ideas on how we could
> mitigate this issue
> >> >> >    or warn users against this pitfall?
> >> >> >    3. *Read path:* On the read side, does the scan/split logic
> already
> >> >> >    handle partitions with heterogeneous bucket counts, or would
> changes be
> >> >> >    needed there as well?
> >> >> >
> >> >> >
> >> >> > Any guidance on gotchas or prior art in this area would be greatly
> >> >> > appreciated. Happy to share the full diff or open a draft PR if
> that would
> >> >> > be easier to review.
> >> >> >
> >> >> > --
> >> >> > Thanks,
> >> >> > Mike Dias
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks,
> >> > Mike Dias
> >
> >
> >
> > --
> > Thanks,
> > Mike Dias
>


-- 
Thanks,
Mike Dias

Re: Feasibility of per-partition instead of per-table bucket count

Reply via email to