codope commented on code in PR #9819:
URL: https://github.com/apache/hudi/pull/9819#discussion_r1345533536
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/NewHoodieParquetFileFormat.scala:
##########
@@ -155,9 +160,22 @@ class NewHoodieParquetFileFormat(tableState:
Broadcast[HoodieTableState],
}
} else {
if (logFiles.nonEmpty) {
- val baseFile = createPartitionedFile(InternalRow.empty,
hoodieBaseFile.getHadoopPath, 0, hoodieBaseFile.getFileLen)
- buildMergeOnReadIterator(preMergeBaseFileReader(baseFile),
logFiles, filePath.getParent, requiredSchemaWithMandatory,
- requiredSchemaWithMandatory, outputSchema,
partitionSchema, partitionValues, broadcastedHadoopConf.value.value)
+ val hoodieReaderContext = new
SparkFileFormatInternalRowReaderContext(
+ sparkSession, this, broadcastedHadoopConf.value.value)
+ val logFilesAsJava = logFiles.toStream.map(lf =>
lf.toString).toList.asJava
+ val reader = new HoodieFileGroupReader[InternalRow](
+
hoodieReaderContext.asInstanceOf[HoodieReaderContext[InternalRow]],
+ broadcastedHadoopConf.value.value,
+ tableState.value.tablePath,
+ FSUtils.getCommitTime(logFilesAsJava.get(0)),
Review Comment:
This might change with the log file now having deltacommit time instead of
base instant time.
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieFileGroupReader.java:
##########
@@ -63,7 +64,7 @@
* @param <T> The type of engine-specific record representation, e.g.,{@code
InternalRow}
* in Spark and {@code RowData} in Flink.
*/
-public final class HoodieFileGroupReader<T> implements Closeable {
+public final class HoodieFileGroupReader<T> implements Closeable, Iterator {
Review Comment:
Should reader implement Iterator? We can simply keep the list of merged
records here and then get the iterator out of list at the call site. If you
plan to make it an iterator, then implement `ClosableIterator` and rename as
`HoodieFileGroupIterator`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]