cshuo opened a new issue, #18974: URL: https://github.com/apache/hudi/issues/18974
### Task Description **Describe the problem** `HoodieSchema` currently collapses engine-level small integer types into a single `INT` type, which loses integer width information needed to faithfully reconstruct engine-native schemas. Examples: - Spark: `ByteType | ShortType | IntegerType -> HoodieSchemaType.INT` - Flink: `TINYINT | SMALLINT | INTEGER -> HoodieSchemaType.INT` This becomes a problem because writer paths are built from `HoodieSchema`, and later reconstruct engine-native schemas from it. On Spark, the path is roughly: 1. `HoodieSparkSchemaConverters.toHoodieType(...)` maps `ByteType/ShortType/IntegerType` to `HoodieSchemaType.INT` 2. `HoodieSparkFileWriterFactory` reconstructs `StructType` from `HoodieSchema` 3. `HoodieSparkSchemaConverters.toSqlType(...)` maps `HoodieSchemaType.INT` back to `IntegerType` 4. `HoodieRowParquetWriteSupport` makes type-dependent row accessor decisions from that reconstructed `StructType` At that point, the original Spark type width is already lost. For example, an original `ShortType` field is reconstructed as `IntegerType`, and writer code may go through `row.getInt(...)` instead of `row.getShort(...)`. Flink has the same issue in principle: - `TINYINT/SMALLINT/INTEGER` are also collapsed into `HoodieSchemaType.INT` - writer code later reconstructs `RowType` from `HoodieSchema` - the original integer width is no longer recoverable from `HoodieSchema` alone So this is not just a schema round-trip fidelity issue. `HoodieSchema` currently does not preserve enough information for engine-native writer construction. **To Reproduce** 1. Create a Spark `StructType` or Flink `RowType` containing `TINYINT` or `SMALLINT` 2. Convert it to `HoodieSchema` 3. Reconstruct engine-native schema from that `HoodieSchema` 4. Observe that the field has already been widened to `INT` / `IntegerType` **Expected behavior** `HoodieSchema` should preserve integer width information so that: - Spark `ByteType`, `ShortType`, `IntegerType` - Flink `TINYINT`, `SMALLINT`, `INTEGER` do not all collapse into the same schema type. **Environment Description** Hudi version: current master / current branch **Possible solution** `rfc-99` already describes primitive integer widths such as `TINYINT` and `SMALLINT` as first-class logical types. We should add native integer-width types such as `TINYINT` and `SMALLINT` to `HoodieSchema`, and preserve them across Spark/Flink schema conversion paths. ### Task Type Code improvement/refactoring ### Related Issues **Parent feature issue:** (if applicable ) **Related issues:** NOTE: Use `Relationships` button to add parent/blocking issues after issue is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
