Re: Feasibility of per-partition instead of per-table bucket count

Mike Dias via dev Wed, 25 Feb 2026 22:28:22 -0800

Thanks, Jingsong!

It seems we already check the number of buckets being equal when
committing here ->
https://github.com/apache/paimon/blob/e1eeec56954c19ed78fd0bd4a46e0a332443397d/paimon-core/src/main/java/org/apache/paimon/operation/commit/ConflictDetection.java#L219
.


I think that should capture the first scenario where:

   - writer starts
   - rescale starts
   - rescale commits
   - writer commits -> fails because the number of buckets changed

I don't think it would address the second scenario where:

   - rescale starts
   - writer starts
   - writes commits
   - rescale commits -> previous commit is overwritten

Is my understanding correct? Not sure if it is possible to detect the
second scenario, though... users will need to ensure that no writer is
running/started duing the rescaling process.


On Thu, Feb 26, 2026 at 3:24 PM Jingsong Li <[email protected]> wrote:

> Hi Mike,
>
> This is a good question.
>
> As far as I know, Paimon does not strictly check that all partitions
> must have the same number of buckets. It is possible to achieve
> different buckets for different partitions, but it is more complex. We
> may need to scan the manifests when writing to ensure that the number
> of buckets written to the partitions is the same as before, otherwise
> it will cause inconsistent data correctness issues.
>
> Best,
> Jingsong
>
> On Mon, Feb 16, 2026 at 1:19 PM Mike Dias via dev <[email protected]>
> wrote:
> >
> > Hi Paimon maintainers,
> >
> > I'm looking to implement a change that would allow different partitions
> > within a PK fixed-bucket table to have different bucket counts, primarily
> > to support highly skewed partitions with more/fewer buckets.
> >
> > We would use dynamic buckets to handle skew, but we really need multiple
> > writers writing to the same active partitions in both streaming and
> batch,
> > which doesn't seem to be something we could easily support with dynamic
> > buckets without coordinating changes to the bucket index file...
> >
> > On the fixed-buckets side, though, it seems we are in a good spot to
> > implement per-partition bucketing, and this rescale doc
> > <https://paimon.apache.org/docs/1.3/maintenance/rescale-bucket/>
> suggests
> > we can already do that for partitions that aren't receiving writes.
> > Unfortunately, our partitions are not time-based, and most of them are
> > always receiving writes...
> >
> > Hence, we would need to adapt the current code to allow writers to look
> up
> > the bucket counts from the manifest partition rather than relying on the
> > global table bucket count.
> >
> > That brings me to the following questions:
> >
> >    1. *Can we actually do this?:* Are there architectural reasons why
> >    bucket counts must be uniform across all partitions? Are there
> assumptions
> >    elsewhere in the codebase that depend on a single global bucket count?
> >    2. *Concurrent writers:* If multiple writers are active, they each
> >    independently load the partition bucket mapping at initialization,
> which
> >    creates a risk of inconsistency if a rescale operation completes
> between
> >    when different writers load their mappings. This is not too different
> from
> >    the existing behavior, but with a global bucket count, it is much
> easier to
> >    safeguard against it. Do you have ideas on how we could mitigate this
> issue
> >    or warn users against this pitfall?
> >    3. *Read path:* On the read side, does the scan/split logic already
> >    handle partitions with heterogeneous bucket counts, or would changes
> be
> >    needed there as well?
> >
> >
> > Any guidance on gotchas or prior art in this area would be greatly
> > appreciated. Happy to share the full diff or open a draft PR if that
> would
> > be easier to review.
> >
> > --
> > Thanks,
> > Mike Dias
>


-- 
Thanks,
Mike Dias

Re: Feasibility of per-partition instead of per-table bucket count

Reply via email to