Re: [I] HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths [hudi]

via GitHub Sun, 14 Jun 2026 21:15:43 -0700


cshuo commented on issue #18974:
URL: https://github.com/apache/hudi/issues/18974#issuecomment-4704504067


   I dug into the Spark writer implementation a bit more, and find that the 
Spark write path is actually not broken for ByteType/ShortType.
   
   Even though HoodieSchema collapses them into INT, the Spark writer widens 
byte/short → int before storing as part of the normal record-rewrite step, so 
the stored row, data-file schema, and read path stay consistently IntegerType — 
no getInt()/getShort() mismatch. 
https://github.com/apache/hudi/blob/1a8c8ad1f85129d70a580d0b1965da150c0ed352/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala#L189-L197
   
   (Short/Byte value in `sourceRow` are rewritten into Int value in `targetRow`)
   
   In other words, on the Spark side this is a type-fidelity characteristic 
(ShortType round-trips back as IntegerType), not a write-correctness bug.
   
   Given that, We have a 3rd option that's narrower than the two above: Have 
the Flink writer do the same pre-write RowData rewrite & widen as Spark.
   
   The goal here is just to make Flink support writing/reading TINYINT/SMALLINT 
columns, without requiring native TINYINT/SMALLINT types in HoodieSchema. Just 
like Spark: accept the columns but widen them to INT for storage and read them 
back as INT.
   
   - Write: insert a RowData conversion (natural hook: RowDataToHoodieFunction, 
plus the bulk-insert path) that reads byte/short and widens them to int into a 
new RowData built against the target RowType reconstructed from the schema 
(already INT). After this, RowData / parquet / Avro schema are all consistently 
INT, so any code path that rebuilds a RowType from the schema and accesses the 
RowData no longer hits a getInt()-on-physical-short ClassCastException.
   - Read:  the read RowType is already derived from the Avro schema (INT), so 
the column naturally comes back as int, and we need rewrite the int value to 
origin SMALLINT/SHORTINT type before emitting to downstreams.
   
   This keeps HoodieSchema unchanged and aligns Flink's behavior with Spark — 
TINYINT/SMALLINT become writable/readable, stored and returned as INT. Worth 
noting: the rewrite/widen only kicks in when the schema actually contains 
TINYINT/SMALLINT — for tables without these types the conversion is a no-op (or 
can be skipped entirely), so there's no extra overhead in the common case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths [hudi]

Reply via email to