Hi all, Since config names are the first thing users see when working with Hudi and directly impact user and dev experience, we should pay careful attention to keeping them standardized and easy to remember and use. I wanted to start this thread to raise some points so we can establish a set of standards and create a migration path.
1. Plural vs Singular If a config supports taking multiple values, it has to be plural if applicable. For e.g., since Hudi 1.1, we support multiple ordering fields, we should make `hoodie.datasource.write.precombine.field` plural. To show a little bit seriousness, treat this kind of misleading config name (singular but supports multiple values) as a bug. 2. Namespaces Always start with `hoodie.<function area>.` as the namespace to denote the area of the config would serve. For e.g., `hoodie.table.*` is always a table config, `hoodie.write.*` is meant for writer to set, `hoodie.read.*` is meant for query engines to use, `hoodie.<compaction|clustering|cleaning|indexing>.*` always denotes table service specific configs, `hoodie.<storage>.*` indicates configs that control storage layer settings, `hoodie.table.metadata.*` is specific for the metadata table. Keep these namespaces a fixed set of constants (a mandatory enum for composing config names), and do not causally change the words, like `compaction` vs `compact`, `cleaning` vs `clean` 3. snake_case Use `.` to delimit functionally distinct words and `_` (snake_case) to connect a meaningful phrase. For example: - `hoodie.table.recordkey.fields` should be `hoodie.table.record_key.fields`, as `recordkey` is not one word and should follow snake_case. - `hoodie.table.keygenerator.class` should be `hoodie.table.key_generator.class`, for similar reason - `hoodie.table.index.defs.path` should be `hoodie.table.index_defs.path`, "index defs" putting together is meant for one thing, but reading them separately as "index" and "defs" do not convey meaningful info about this config - `hoodie.file.group.reader.enabled` should be `hoodie.file_group.reader.enabled`, for similar reason 4. `hoodie.properties` only for catalog/table configs Only keep catalog/table configs in `hoodie.properties`; keep configs like `hoodie.datasource.write.*` out of it, add new table configs for those do not have a table config alias. For e.g., remove `hoodie.datasource.write.hive_style_partitioning` and put `hoodie.table.hive_style_partitioning` instead. 5. Improve naming case by case Some examples to consider: - All `hoodie.datasource.write.*` move to `hoodie.write.*`, keep things shorter - All feature-switching configs end with `enabled`, not to mix with `enable` - All meta/hive-sync related configs move to `hoodie.catalog.sync.*`, clearly stating it's working with catalogs, and the function is about "sync" 6. Standardize shorthand property names in SQL TBLPROPERTIES Everyone's first example of running Hudi has contained something like this TBLPROPERTIES ( primaryKey = 'id', preCombineField = 'ts' ); Let's fix it: - "record key" is the term in Hudi so we don't want people to remember "primary key is meant for record key", and make sure the plural rule applies - "ordering field" is the newer term so let's deprecate the term "pre-combine field", and make sure the plural rule applies too - again, snake_case all the way so it should be like below (omit the `hoodie.table.` namespace) so people can associate them with the full name easily: TBLPROPERTIES ( record_key.fields = 'id', ordering.fields = 'ts' ); - in cases where non-table configs need to be put in TBLPROPERTIES() , we can just omit `hoodie.` since we have `USING HUDI` in the SQL, so it should support `read.*`, `write.*`, `storage.*` sort of shorthand keys 7. Address discrepancies between Flink options and Spark options A one-time sweep of flink configs that diverge from Spark configs, and align them according to the standards we're making. The goals are: - All `hoodie.*` configs should be engine-agnostic and universally accepted by all engines when applicable - Any engine-specific config should be owned by the engine, and starts with `hudi.` (like how the Trino Hudi connector does now) About migration: we should start adding new config names while keeping the old ones compatible as aliases. That means, throughout the codebase, config variables will contain the standard strings as the names, and any user-provided config will be translated to its new name if applicable. We don't really want to fail writers/readers just because of old config names so we can keep the aliases for quite some time, but there has to be deprecation warnings from now, and drop aliases at some major release (like 2.0 or 3.0). But before that, any table version upgrade should strive to rename the configs in `hoodie.properties` as per the standards to evangelize the new names. Best, Shiyan