cshuo commented on issue #18974: URL: https://github.com/apache/hudi/issues/18974#issuecomment-4704504067
I dug into the Spark writer implementation a bit more, and find that the Spark write path is actually not broken for ByteType/ShortType. Even though HoodieSchema collapses them into INT, the Spark writer widens byte/short → int before storing as part of the normal record-rewrite step, so the stored row, data-file schema, and read path stay consistently IntegerType — no getInt()/getShort() mismatch. https://github.com/apache/hudi/blob/1a8c8ad1f85129d70a580d0b1965da150c0ed352/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieCreateRecordUtils.scala#L189-L197 (Short/Byte value in `sourceRow` are rewritten into Int value in `targetRow`) In other words, on the Spark side this is a type-fidelity characteristic (ShortType round-trips back as IntegerType), not a write-correctness bug. Given that, We have a 3rd option that's narrower than the two above: Have the Flink writer do the same pre-write RowData rewrite & widen as Spark. The goal here is just to make Flink support writing/reading TINYINT/SMALLINT columns, without requiring native TINYINT/SMALLINT types in HoodieSchema. Just like Spark: accept the columns but widen them to INT for storage and read them back as INT. - Write: insert a RowData conversion (natural hook: RowDataToHoodieFunction, plus the bulk-insert path) that reads byte/short and widens them to int into a new RowData built against the target RowType reconstructed from the schema (already INT). After this, RowData / parquet / Avro schema are all consistently INT, so any code path that rebuilds a RowType from the schema and accesses the RowData no longer hits a getInt()-on-physical-short ClassCastException. - Read: the read RowType is already derived from the Avro schema (INT), so the column naturally comes back as int, and we need rewrite the int value to origin SMALLINT/SHORTINT type before emitting to downstreams. This keeps HoodieSchema unchanged and aligns Flink's behavior with Spark — TINYINT/SMALLINT become writable/readable, stored and returned as INT. Worth noting: the rewrite/widen only kicks in when the schema actually contains TINYINT/SMALLINT — for tables without these types the conversion is a no-op (or can be skipped entirely), so there's no extra overhead in the common case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
