cshuo commented on issue #18974: URL: https://github.com/apache/hudi/issues/18974#issuecomment-4703894877
> [@cshuo](https://github.com/cshuo) what changes we need to make if TINYINT and SMALLINT are introduced, for e.g, do we need fixes to the col_stats? Yes, `col_stats` will need a small follow-up, but a new stats type is not necessarily required. `TINYINT` / `SMALLINT` can continue to use the existing integer stats path; we just need the new schema types to be recognized there. More broadly, supporting `TINYINT` / `SMALLINT` involves at least these areas: 1. Shared schema model `HoodieSchema` / `HoodieSchemaType` needs to represent `TINYINT` and `SMALLINT` explicitly, while keeping compatible underlying storage. 2. Engine schema converters Spark, Flink and other engine converters need to preserve width when converting `TINYINT` / `SMALLINT` types. 3. Writer schema reconstruction Any writer path that rebuilds engine-native schema from `HoodieSchema` needs to round-trip these types correctly, otherwise they still get widened back to `INT`. 4. Metadata / stats / utility handling Places that switch on `HoodieSchemaType` for metadata handling, comparisons, defaults, partition parsing, or column stats need to recognize the new types and route them through the appropriate existing integer behavior. 5. Existing table compatibility The physical storage risk is relatively small, since this is still backed by the same underlying integer family with extra logical-width information on the Hudi schema side. The bigger risk is schema semantics for existing tables. Older tables already record these fields as `INT` in `HoodieSchema`. Once new code starts materializing `TINYINT` / `SMALLINT` explicitly, the system will see a schema difference where previously it saw the same type. That can affect schema validation, schema evolution checks, and concurrent schema conflict resolution. 6. Schema evolution semantics This likely needs explicit rules. The tricky part is that an existing `INT` in table schema is ambiguous: it may be a real `INT`, or it may be a legacy-collapsed `TINYINT` / `SMALLINT`. So we cannot blindly treat `INT -> SMALLINT/TINYINT` as a universally safe evolution rule. 7. Compatibility and tests We need round-trip and writer-path coverage for both Spark and Flink, plus targeted checks for metadata-related paths such as column stats, as well as validation around existing-table behavior and schema evolution. So my current view is: - `col_stats` is a small follow-up, the main work is not just schema enums, but making sure width survives the full path from engine schema to `HoodieSchema` to reconstructed engine schema to writer - the main design sensitivity is around existing tables and schema evolution, because old `INT` does not tell us whether it was originally a true `INT` or a collapsed narrower engine type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
