wombatu-kun commented on issue #16238: URL: https://github.com/apache/iceberg/issues/16238#issuecomment-4533267778
I looked into this and can confirm it's a divergence between two default partition-name generators in core/api, not a Spark-side issue. - Table creation (including Spark `createOrReplace`, which routes through `Spark3Util.toPartitionSpec` → `PartitionSpec.Builder.bucket(...)`) generates `col_bucket` / `col_trunc` — without the parameter. - `ALTER TABLE ADD PARTITION FIELD` routes through `BaseUpdatePartitionSpec.PartitionNameGenerator`, which generates `col_bucket_<n>` / `col_trunc_<width>` — with the parameter. - Time transforms (`year`/`month`/`day`/`hour`) already agree across both paths. - For reference, the spec's own example uses the no-parameter form: the partition-spec example in `format/spec.md` shows `"name": "id_bucket"` for a `bucket[16]` transform. Both forms are currently asserted by tests on each side, so this looks like a deliberate-but-unreconciled difference rather than an accidental bug, and standardizing it would be a user-visible, cross-engine behavior change. Before anyone implements a fix, it would be good to settle the direction: 1. Align `ALTER` to the creation/spec form (`col_bucket`). Smaller footprint and matches the spec example. The `_<n>` suffix in `BaseUpdatePartitionSpec` mainly disambiguates adding two different bucket widths on the same source column within the current spec (the case pinned by `testAddMultipleBuckets`); collisions with previously-dropped (void) fields are already handled separately by renaming them to `name_<fieldId>`. So this direction could keep the bare `col_bucket` name and append `_<n>` only on an actual name conflict, which preserves the multi-width case while making the common `ALTER` match creation. 2. Align creation to the `ALTER` form (`col_bucket_<n>`), which is the preference noted in this issue and makes the builder strictly more expressive. The downside is that it changes the default name for the most common path across every engine, deviates from the spec example, and touches a large number of existing tests. Separately, the "add multiple widths on the same source column" behavior is currently covered by a positive test only for `bucket` (`testAddMultipleBuckets`), not for `truncate`, even though both behave identically. Could a maintainer weigh in on the preferred direction (or whether this is intended and should just be documented)? Happy to put up the PR once the direction is decided. cc @rdblue @RussellSpitzer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
