danny0405 commented on code in PR #13572:
URL: https://github.com/apache/hudi/pull/13572#discussion_r2233600790
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##########
@@ -266,6 +271,24 @@ class
SparkFileFormatInternalRowReaderContext(parquetFileReader: SparkParquetRea
}.asInstanceOf[ClosableIterator[InternalRow]]
}
}
+
+ override def getDataFileSchema(filePath: StoragePath, storage:
HoodieStorage): Schema = {
+ val configuration =
storageConfiguration.asInstanceOf[StorageConfiguration[Configuration]].unwrap()
+ if (configuration.get(AvroSchemaConverter.ADD_LIST_ELEMENT_RECORDS) ==
null) {
+ configuration.set(AvroSchemaConverter.ADD_LIST_ELEMENT_RECORDS, "false")
+ }
+ val path = HadoopFSUtils.convertToHadoopPath(filePath)
+ val readOptions = HadoopReadOptions.builder(configuration, path)
+ .withMetadataFilter(ParquetMetadataConverter.SKIP_ROW_GROUPS).build
+ val inputFile = HadoopInputFile.fromPath(path, configuration)
+ try {
+ val fileReader = ParquetFileReader.open(inputFile, readOptions)
+ try {
+ val footer = fileReader.getFooter
+ new
AvroSchemaConverter(configuration).convert(footer.getFileMetaData.getSchema)
Review Comment:
I'm curious about the scope of the fix, if the patch is just to fix the
behaviors when `hoodie.schema.on.read.enable` is enabled, we should always read
from the internal schema instead which is more efficient.
if the fix is for scenarios when `hoodie.schema.on.read.enable` is false, we
asleast should introduce a new option to control the access of schema from data
file IO.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]