cshuo commented on issue #18711: URL: https://github.com/apache/hudi/issues/18711#issuecomment-4419033901
> Spark's read path already has HoodieSchema available at the point where Parquet → Spark type conversion happens. Small clarification on the Spark comparison, `HoodieSparkParquetReader#getSchema()` still reads the Parquet `MessageType`, converts it to Spark `StructType`, and then converts that to `HoodieSchema`; it is not using `HoodieSchema` to infer Hudi logical types there. The important part is the actual record read path. Spark does not rely on `getSchema()` / Parquet physical inference to recover VECTOR semantics. In `getUnsafeRowIterator(requestedSchema)`, it uses the requested/writer `HoodieSchema` to detect vector columns, rewrites those columns to `BinaryType` for the physical Parquet read, and then converts the binary value back to the logical vector array after reading. So for Flink, I think the same principle should apply: avoid making `Parquet MessageType -> RowType` responsible for recovering Hudi-specific logical types. The Hudi read path should use the writer/table `HoodieSchema` to read the file and then project/convert to the requested schema. `FlinkRowDataReaderContext#getFileRecordIterator` already follows this model by receiving `dataSchema` and constructing the `RowType` from it before reading. Given that, I would prefer Option A. The rationale is that the real Hudi read semantics should be schema-driven, and Parquet schema conversion should remain a generic/best-effort fallback rather than the authority for Hudi logical types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
