Re: [PR] [HUDI-5807] Read partition values from file and create infra to support reading only a subset of columns [hudi]

via GitHub Wed, 04 Sep 2024 10:57:08 -0700


jonvex commented on code in PR #11770:
URL: https://github.com/apache/hudi/pull/11770#discussion_r1744205574



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##########
@@ -231,4 +270,32 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tableState: HoodieTableState,
       override def close(): Unit = closeableFileGroupRecordIterator.close()
     }
   }
+
+  private def readBaseFile(file: PartitionedFile, parquetFileReader: 
SparkParquetReader, requestedSchema: StructType,
+                           remainingPartitionSchema: StructType, 
fixedPartitionIndexes: Set[Int], requiredSchema: StructType,
+                           partitionSchema: StructType, outputSchema: 
StructType, filters: Seq[Filter],
+                           storageConf: StorageConfiguration[Configuration]): 
Iterator[InternalRow] = {
+    if (remainingPartitionSchema.fields.length == 
partitionSchema.fields.length) {

Review Comment:
   I tested with TestSparkSqlWithCustomKeyGenerator by changing 
partitionColumnsToRead in HoodieHadoopFsRelationFactory  by doing:
   ```
   //TODO: [HUDI-8098] filter for timestamp keygen columns when using custom 
keygen
       tableConfig.getPartitionFields.orElse(Array.empty).filter(p => p == 
"ts").toSeq
   ```
   to fake what [HUDI-8036] + [HUDI-8098] will do. This exposed a case that I 
didn't test. For MOR with log files where we read some, but not all of the 
partition columns, I was not doing appending correctly. I have updated 
HoodieFileGroupReaderBasedParquetFileFormat.appendPartitionAndProject to do 
this correctly now. 
   
   I feel like appendPartitionAndProject and readBaseFile have overlapping 
logic, but can't think of a better way to do this for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-5807] Read partition values from file and create infra to support reading only a subset of columns [hudi]

Reply via email to