[I] HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths [hudi]

via GitHub Thu, 11 Jun 2026 00:22:40 -0700


cshuo opened a new issue, #18974:
URL: https://github.com/apache/hudi/issues/18974


   ### Task Description
   
   **Describe the problem**
   
   `HoodieSchema` currently collapses engine-level small integer types into a 
single `INT` type, which loses integer width information needed to faithfully 
reconstruct engine-native schemas.
   
   Examples:
   - Spark: `ByteType | ShortType | IntegerType -> HoodieSchemaType.INT`
   - Flink: `TINYINT | SMALLINT | INTEGER -> HoodieSchemaType.INT`
   
   This becomes a problem because writer paths are built from `HoodieSchema`, 
and later reconstruct engine-native schemas from it.
   
   On Spark, the path is roughly:
   
   1. `HoodieSparkSchemaConverters.toHoodieType(...)` maps 
`ByteType/ShortType/IntegerType` to `HoodieSchemaType.INT`
   2. `HoodieSparkFileWriterFactory` reconstructs `StructType` from 
`HoodieSchema`
   3. `HoodieSparkSchemaConverters.toSqlType(...)` maps `HoodieSchemaType.INT` 
back to `IntegerType`
   4. `HoodieRowParquetWriteSupport` makes type-dependent row accessor 
decisions from that reconstructed `StructType`
   
   At that point, the original Spark type width is already lost. For example, 
an original `ShortType` field is reconstructed as `IntegerType`, and writer 
code may go through `row.getInt(...)` instead of `row.getShort(...)`.
   
   Flink has the same issue in principle:
   - `TINYINT/SMALLINT/INTEGER` are also collapsed into `HoodieSchemaType.INT`
   - writer code later reconstructs `RowType` from `HoodieSchema`
   - the original integer width is no longer recoverable from `HoodieSchema` 
alone
   
   So this is not just a schema round-trip fidelity issue. `HoodieSchema` 
currently does not preserve enough information for engine-native writer 
construction.
   
   **To Reproduce**
   
   1. Create a Spark `StructType` or Flink `RowType` containing `TINYINT` or 
`SMALLINT`
   2. Convert it to `HoodieSchema`
   3. Reconstruct engine-native schema from that `HoodieSchema`
   4. Observe that the field has already been widened to `INT` / `IntegerType`
   
   **Expected behavior**
   
   `HoodieSchema` should preserve integer width information so that:
   - Spark `ByteType`, `ShortType`, `IntegerType`
   - Flink `TINYINT`, `SMALLINT`, `INTEGER`
   
   do not all collapse into the same schema type.
   
   **Environment Description**
   
   Hudi version: current master / current branch
   
   **Possible solution**
   
   `rfc-99` already describes primitive integer widths such as `TINYINT` and 
`SMALLINT` as first-class logical types. We should add native integer-width 
types such as `TINYINT` and `SMALLINT` to `HoodieSchema`, and preserve them 
across Spark/Flink schema conversion paths. 
   
   ### Task Type
   
   Code improvement/refactoring
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths [hudi]

Reply via email to