jonvex commented on code in PR #10137:
URL: https://github.com/apache/hudi/pull/10137#discussion_r1408151212
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala:
##########
@@ -62,7 +63,7 @@ class SparkFileFormatInternalRowReaderContext(baseFileReader:
Option[Partitioned
requiredSchema: Schema,
conf: Configuration):
ClosableIterator[InternalRow] = {
val fileInfo = sparkAdapter.getSparkPartitionedFileUtils
- .createPartitionedFile(partitionValues, filePath, start, length)
+ .createPartitionedFile(InternalRow.empty, filePath, start, length)
Review Comment:
Ok so part of this PR is to clean up the philosophy of the fg reader.
We want to give the fg reader a requested schema, and we want the output to
be an iterator with records in exactly that schema (maybe not with cdc, haven't
given it much thought yet).
spark parquet file reader will append the partition values to the end of
each record. Putting the logic for dealing with that inside of the fg reader
is adding unnecessary stuff into the fg reader that is probably only relevant
for spark.
Therefore I think it makes sense for
HoodieFileGroupReaderBasedParquetFileFormat to be responsible for appending the
partition columns to the end. In addition to being better organized in my
opinion, this will also be more performant because it prevents adding several
calls that append the partition col and then project the column away.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]