Re: [PR] [HUDI-8273] Fix read optimized query for bootstrap tables [hudi]

via GitHub Tue, 08 Oct 2024 11:57:58 -0700


yihua commented on code in PR #12056:
URL: https://github.com/apache/hudi/pull/12056#discussion_r1792332106



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestNewHoodieParquetFileFormat.java:
##########
@@ -114,25 +117,42 @@ protected void runIndividualComparison(String 
tableBasePath) {
   }
 
   protected void runIndividualComparison(String tableBasePath, String 
firstColumn, String... columns) {
-    Dataset<Row> legacyDf = sparkSession.read().format("hudi")
-        .option(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), "false")
-        .load(tableBasePath);
-    Dataset<Row> fileFormatDf = sparkSession.read().format("hudi")
-        .option(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(), "true")
-        .load(tableBasePath);
-    if (firstColumn.isEmpty()) {
-      //df.except(df) does not work with map type cols
-      legacyDf = legacyDf.drop("city_to_state");
-      fileFormatDf = fileFormatDf.drop("city_to_state");
-    } else {
-      if (columns.length > 0) {
-        legacyDf = legacyDf.select(firstColumn, columns);
-        fileFormatDf = fileFormatDf.select(firstColumn, columns);
+    List<String> queryTypes = new ArrayList<>();

Review Comment:
   Do we have tests cover RO queries on bootstrapped MOR table with updates in 
log files?  Only changing the validation may not be enough as we need to cover 
such cases particularly (might need to add preconditions to check that a MOR 
table has updates in log files for such cases).



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -183,9 +183,9 @@ case class HoodieFileIndex(spark: SparkSession,
           }).filter(slice => slice != null)
             .map(fileInfo => new FileStatus(fileInfo.getLength, 
fileInfo.isDirectory, 0, fileInfo.getBlockSize,
               fileInfo.getModificationTime, new Path(fileInfo.getPath.toUri)))
-          val c = fileSlices.filter(f => (includeLogFiles && 
f.getLogFiles.findAny().isPresent)
-            || (f.getBaseFile.isPresent && 
f.getBaseFile.get().getBootstrapBaseFile.isPresent)).
-            foldLeft(Map[String, FileSlice]()) { (m, f) => m + (f.getFileId -> 
f) }
+          val c = fileSlices.filter(f => f.hasBootstrapBase || 
(includeLogFiles && f.hasLogFiles))

Review Comment:
   Is this problem Spark only or file index in generally?  This file index is 
only used by Spark?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8273] Fix read optimized query for bootstrap tables [hudi]

Reply via email to