Re: [I] HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths [hudi]

via GitHub Sun, 14 Jun 2026 19:00:27 -0700


cshuo commented on issue #18974:
URL: https://github.com/apache/hudi/issues/18974#issuecomment-4703894877


   > [@cshuo](https://github.com/cshuo) what changes we need to make if TINYINT 
and SMALLINT are introduced, for e.g, do we need fixes to the col_stats?
   
   Yes, `col_stats` will need a small follow-up, but a new stats type is not 
necessarily required. `TINYINT` / `SMALLINT` can continue to use the existing 
integer stats path; we just need the new schema types to be recognized there.
   
   More broadly, supporting `TINYINT` / `SMALLINT` involves at least these 
areas:
   
   1. Shared schema model  
      `HoodieSchema` / `HoodieSchemaType` needs to represent `TINYINT` and 
`SMALLINT` explicitly, while keeping compatible underlying storage.
   
   2. Engine schema converters  
      Spark, Flink and other engine converters need to preserve width when 
converting `TINYINT` / `SMALLINT` types.
   
   3. Writer schema reconstruction  
      Any writer path that rebuilds engine-native schema from `HoodieSchema` 
needs to round-trip these types correctly, otherwise they still get widened 
back to `INT`.
   
   4. Metadata / stats / utility handling  
      Places that switch on `HoodieSchemaType` for metadata handling, 
comparisons, defaults, partition parsing, or column stats need to recognize the 
new types and route them through the appropriate existing integer behavior.
   
   5. Existing table compatibility  
      The physical storage risk is relatively small, since this is still backed 
by the same underlying integer family with extra logical-width information on 
the Hudi schema side.  
      The bigger risk is schema semantics for existing tables. Older tables 
already record these fields as `INT` in `HoodieSchema`. Once new code starts 
materializing `TINYINT` / `SMALLINT` explicitly, the system will see a schema 
difference where previously it saw the same type. That can affect schema 
validation, schema evolution checks, and concurrent schema conflict resolution.
   
   6. Schema evolution semantics  
      This likely needs explicit rules. The tricky part is that an existing 
`INT` in table schema is ambiguous: it may be a real `INT`, or it may be a 
legacy-collapsed `TINYINT` / `SMALLINT`. So we cannot blindly treat `INT -> 
SMALLINT/TINYINT` as a universally safe evolution rule.
   
   7. Compatibility and tests  
      We need round-trip and writer-path coverage for both Spark and Flink, 
plus targeted checks for metadata-related paths such as column stats, as well 
as validation around existing-table behavior and schema evolution.
   
   So my current view is:
   - `col_stats` is a small follow-up, the main work is not just schema enums, 
but making sure width survives the full path from engine schema to 
`HoodieSchema` to reconstructed engine schema to writer
   - the main design sensitivity is around existing tables and schema 
evolution, because old `INT` does not tell us whether it was originally a true 
`INT` or a collapsed narrower engine type


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths [hudi]

Reply via email to