nsivabalan opened a new pull request, #19047:
URL: https://github.com/apache/hudi/pull/19047
### Describe the issue this Pull Request addresses
Today the only way to opt out of populating Hudi's five meta columns is the
all-or-nothing `hoodie.populate.meta.fields=false`. That saves storage but
disables incremental queries (which require `_hoodie_commit_time`).
A community user surfaced this trade-off (#18383, also discussed at #17959).
The concrete ask was: "give me the storage saving without giving up incremental
queries." A separate exploratory PR (#18384) attempted a fully orthogonal
exclude-list with per-field branching across the writer/reader paths; that
surface ended up being ~2300 lines across 87 files. This PR proposes a simpler,
scoped alternative: three named modes instead of the full 2^5 matrix.
Closes #18383.
### Summary and Changelog
Adds an additive opt-in flag, `hoodie.meta.fields.commit.time.enabled`, that
— when set together with `hoodie.populate.meta.fields=false` — additionally
populates `_hoodie_commit_time` so incremental queries remain functional. The
remaining four meta columns stay null on disk, preserving the storage saving.
The three resulting modes:
| `populate.meta.fields` | `meta.fields.commit.time.enabled` | Effective
mode |
|---|---|---|
| `true` (default) | ignored | **ALL** — today's default |
| `false` | `false` (default) | **NONE** — today's
`populate.meta.fields=false` |
| `false` | `true` | **COMMIT_TIME_ONLY** — new |
| `true` | `true` | rejected at writer init (ambiguous) |
#### Why a separate boolean instead of a single enum
- **Bit-identical backward compatibility.** Every existing table on disk
resolves to ALL or NONE without any new property being read. No reader-side
migration. No precedence rules.
- **Pre-1.3.0 readers behave correctly.** They don't know the new property
exists. They open a COMMIT_TIME_ONLY table, see `populate.meta.fields=false`,
and behave as a NONE reader — they cannot do incremental queries on the table,
but they don't produce silent wrong results either.
- **Encodes "additive" structurally.** The new flag only modifies a NONE
table — it's literally a NONE table plus one populated column. Most code paths
that branch on `populate.meta.fields` keep working unchanged; only paths that
specifically need commit_time consult the new accessor.
#### Plug points
**Config + accessors (`hudi-common` / `hudi-client-common`):**
- New `HoodieTableConfig.META_FIELDS_COMMIT_TIME_ENABLED` property.
- New accessors: `isCommitTimeOnlyMetaFieldsMode()`,
`isCommitTimePopulated()`, `isRecordKeyPopulated()` — three named predicates.
- `HoodieWriteConfig` pass-throughs +
`Builder.withMetaFieldsCommitTimeEnabled()`.
- `HoodieWriteConfig.validate()` rejects the `populate=true` +
`commit.time=true` combination at build time.
- `HoodieTableMetaClient.TableBuilder.setMetaFieldsCommitTimeEnabled()`
persists the flag on `hoodie.properties` at table init.
- `HoodieSparkSqlWriter` wires both fresh-table and bootstrap creation paths.
**Writer engines:**
- `HoodieAvroParquetWriter`, `HoodieSparkParquetWriter`,
`HoodieRowCreateHandle` each gain a `commitTimeOnly` constructor overload. When
`commitTimeOnly && !populateMetaFields`, they populate `_hoodie_commit_time`
and the derived seq id; the other four columns stay null. Bloom-filter /
record-key index registration is intentionally skipped (the record-key column
is not populated).
**Read path (incremental query rejection):**
- `IncrementalRelationV1/V2`, `MergeOnReadIncrementalRelationV1/V2` now
check `isCommitTimePopulated()` rather than `populateMetaFields()` —
COMMIT_TIME_ONLY tables are accepted, NONE tables remain rejected with a
clearer message.
#### Scope
- ✅ Spark Avro / Spark Row writer paths.
- ✅ Spark Parquet bulk-insert.
- ✅ Incremental query rejection logic across V1 / V2 / CoW / MoR.
- ❌ Flink RowData writer — out of scope for this patch; behaves as NONE
under COMMIT_TIME_ONLY (no commit_time populated). Tracked as a follow-up.
- ❌ ORC / HFile writers — ORC continues to populate all meta fields
unconditionally (legacy behavior); HFile is used only by MDT which is always
ALL.
### Impact
- **Storage layout**: no change for tables that don't opt in. New optional
mode for tables that do. Default behavior unchanged.
- **API**: no public API breakage. New table property, new accessors, new
builder method — all additive.
- **Configuration**:
- `hoodie.meta.fields.commit.time.enabled` (default `false`). Only
meaningful when `hoodie.populate.meta.fields=false`. Persisted on
`hoodie.properties` at table init.
- **Performance**: writer hot path adds one boolean check per row when in
the new mode; the bool is final and cached in the writer constructor.
- **Forward-compat**: pre-1.3.0 readers ignore the new flag and treat the
table as NONE — no silent wrong results.
### Risk Level
low
Additive change with a narrow scope. The default path is untouched. The
validation guard rejects the ambiguous combination loudly at writer init.
Existing `TestHoodieTableConfig` regression coverage (93 tests) passes
unchanged.
### Documentation Update
- New config `hoodie.meta.fields.commit.time.enabled` documented via
`@ConfigProperty` annotation on `HoodieTableConfig`.
- No public-facing docs update needed in this patch; if the website page on
meta fields exists, a separate docs PR will add the three-mode table.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]