Re: [PR] feat(lance): Implement columnar batch reading for Lance (COW only) [hudi]

via GitHub Tue, 05 May 2026 01:30:49 -0700


wombatu-kun commented on code in PR #18403:
URL: https://github.com/apache/hudi/pull/18403#discussion_r3187019539



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########
@@ -137,69 +149,52 @@ class SparkLanceReaderBase(enableVectorizedReader: 
Boolean) extends SparkColumna
         // the option regardless.
         val blobMode = resolveBlobReadMode(storageConf)
         val readOpts = FileReadOptions.builder().blobReadMode(blobMode).build()
-        val arrowReader = lanceReader.readAll(columnNames, null, 
DEFAULT_BATCH_SIZE, readOpts)
-
-        // Compose the DESCRIPTOR-aware blob transform only when the user 
opted into that mode
-        // AND the request actually has BLOB columns (otherwise the rewrite 
has nothing to do).
-        val blobFieldNames: java.util.Set[String] =
-          iteratorSchema.fields.collect { case f if isBlobField(f) => f.name 
}.toSet.asJava
-        val blobTransform = if (blobMode == BlobReadMode.DESCRIPTOR && 
!blobFieldNames.isEmpty) {
-          new BlobDescriptorTransform(blobFieldNames, filePath)
+        arrowReader = lanceReader.readAll(columnNames, null, 
DEFAULT_BATCH_SIZE, readOpts)
+
+        // Decide between batch mode and row mode.
+        // Fall back to row mode if:
+        //   - type casting is needed (batch-level type casting deferred to 
follow-up), OR
+        //   - the partition schema contains a type the batch-mode 
partition-vector populator
+        //     does not handle (Struct/Array/Map/Char/Varchar/interval, etc.). 
The row path
+        //     preserves these via JoinedRow, so falling back avoids silently 
nulling them out.
+        val hasTypeChanges = !implicitTypeChangeInfo.isEmpty
+        val partitionTypesBatchSupported =
+          partitionSchema.forall(f => 
isPartitionTypeSupportedForBatch(f.dataType))
+        if (enableVectorizedReader && !hasTypeChanges && 
partitionTypesBatchSupported) {

Review Comment:
   Fixed in fe8f7e1c2a56. Moved the gate to plan-time: `lanceBatchSupported` in 
`HoodieFileGroupReaderBasedFileFormat.supportBatch` now also checks 
`!internalSchemaOpt.isPresent`, so when the table carries an internal schema 
(the only way `hasTypeChanges` can fire) the planner will not commit to 
ColumnarBatch output. The runtime row branch in `SparkLanceReaderBase.read` is 
preserved for direct-reader callers — `testTypeChangeFallsBackToRowPath` 
continues to exercise it on the unit path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Implement columnar batch reading for Lance (COW only) [hudi]

Reply via email to