cshuo commented on issue #18711:
URL: https://github.com/apache/hudi/issues/18711#issuecomment-4419033901

   > Spark's read path already has HoodieSchema available at the point where 
Parquet → Spark type conversion happens.
   
   Small clarification on the Spark comparison, 
`HoodieSparkParquetReader#getSchema()` still reads the Parquet `MessageType`, 
converts it to Spark `StructType`, and then converts that to `HoodieSchema`; it 
is not using `HoodieSchema` to infer Hudi logical types there.
   The important part is the actual record read path. Spark does not rely on 
`getSchema()` / Parquet physical inference to recover VECTOR semantics. In 
`getUnsafeRowIterator(requestedSchema)`, it uses the requested/writer 
`HoodieSchema` to detect vector columns, rewrites those columns to `BinaryType` 
for the physical Parquet read, and then converts the binary value back to the 
logical vector array after reading.
   
   So for Flink, I think the same principle should apply: avoid making `Parquet 
MessageType -> RowType` responsible for recovering Hudi-specific logical types. 
The Hudi read path should use the writer/table `HoodieSchema` to read the file 
and then project/convert to the requested schema. 
`FlinkRowDataReaderContext#getFileRecordIterator` already follows this model by 
receiving `dataSchema` and constructing the `RowType` from it before reading.
   
   Given that, I would prefer Option A. The rationale is that the real Hudi 
read semantics should be schema-driven, and Parquet schema conversion should 
remain a generic/best-effort fallback rather than the authority for Hudi 
logical types.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to