jonvex commented on code in PR #13127:
URL: https://github.com/apache/hudi/pull/13127#discussion_r2104802087
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala:
##########
@@ -81,17 +81,31 @@ class
HoodieFileGroupReaderBasedParquetFileFormat(tablePath: String,
private val sanitizedTableName =
AvroSchemaUtils.getAvroRecordQualifiedName(tableName)
/**
- * Support batch needs to remain consistent, even if one side of a bootstrap
merge can support
- * while the other side can't
+ * Support vectorized reading
*/
private var supportBatchCalled = false
+
+ /**
+ * Support return result as batch
+ */
private var supportBatchResult = false
+ /**
+ * Check if the file format supports vectorized reading, please refer to
SPARK-40918
+ *
+ * NOTE: for mor read, even for file-slice with only base file, we can read
parquet file with vectorized, but the return result of the whole
data-source-scan phase cannot be batch,
+ * because when there are any log file in a file slice, it needs to be read
by the file group reader.
+ * Since we are currently performing merges based on rows, the result
returned at this time is also based on rows,
+ * we cannot assume that all file slices have only base files.
+ * So we need to set the batch result back to false
+ *
+ */
override def supportBatch(sparkSession: SparkSession, schema: StructType):
Boolean = {
- if (!supportBatchCalled || supportBatchResult) {
- supportBatchCalled = true
- supportBatchResult = !isMOR && !isIncremental && !isBootstrap &&
super.supportBatch(sparkSession, schema)
- }
+ val superSupportBatch = super.supportBatch(sparkSession, schema)
+ supportBatchCalled = !isIncremental && !isBootstrap && superSupportBatch
Review Comment:
Yeah, the reason I had supportBatchCalled was because I was seeing that
supportBatch() would sometimes switch between true and false and then the
caller would try to do an illegal cast. So this was to prevent that from
happening. It seems like now we should change the variable name to something
else?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]