Re: [PR] [HUDI-9302] Enable vectorized reading for file slice without log file [hudi]

via GitHub Fri, 23 May 2025 08:12:04 -0700


jonvex commented on code in PR #13127:
URL: https://github.com/apache/hudi/pull/13127#discussion_r2104802087



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##########
@@ -81,17 +81,31 @@ class 
HoodieFileGroupReaderBasedParquetFileFormat(tablePath: String,
   private val sanitizedTableName = 
AvroSchemaUtils.getAvroRecordQualifiedName(tableName)
 
   /**
-   * Support batch needs to remain consistent, even if one side of a bootstrap 
merge can support
-   * while the other side can't
+   * Support vectorized reading
    */
   private var supportBatchCalled = false
+
+  /**
+   * Support return result as batch
+   */
   private var supportBatchResult = false
 
+  /**
+   * Check if the file format supports vectorized reading, please refer to 
SPARK-40918
+   *
+   * NOTE: for mor read, even for file-slice with only base file, we can read 
parquet file with vectorized, but the return result of the whole 
data-source-scan phase cannot be batch,
+   * because when there are any log file in a file slice, it needs to be read 
by the file group reader.
+   * Since we are currently performing merges based on rows, the result 
returned at this time is also based on rows,
+   * we cannot assume that all file slices have only base files.
+   * So we need to set the batch result back to false
+   *
+   */
   override def supportBatch(sparkSession: SparkSession, schema: StructType): 
Boolean = {
-    if (!supportBatchCalled || supportBatchResult) {
-      supportBatchCalled = true
-      supportBatchResult = !isMOR && !isIncremental && !isBootstrap && 
super.supportBatch(sparkSession, schema)
-    }
+    val superSupportBatch = super.supportBatch(sparkSession, schema)
+    supportBatchCalled = !isIncremental && !isBootstrap && superSupportBatch

Review Comment:
   Yeah, the reason I had supportBatchCalled was because I was seeing that 
supportBatch() would sometimes switch between true and false and then the 
caller would try to do an illegal cast. So this was to prevent that from 
happening. It seems like now we should change the variable name to something 
else?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9302] Enable vectorized reading for file slice without log file [hudi]

Reply via email to