Re: [PR] feat(lance): Implement columnar batch reading for Lance (COW only) [hudi]

via GitHub Tue, 05 May 2026 01:32:20 -0700


wombatu-kun commented on code in PR #18403:
URL: https://github.com/apache/hudi/pull/18403#discussion_r3187027470



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##########
@@ -133,6 +134,24 @@ class HoodieFileGroupReaderBasedFileFormat(tablePath: 
String,
     }
   }
 
+  /**
+   * Whether the requested schema contains any top-level BLOB columns. Used to 
disable
+   * Lance batch mode for BLOB tables: the DESCRIPTOR-mode rewrite (and the 
OUT_OF_LINE
+   * data→null contract) lives only in the row-path BlobDescriptorTransform, 
and `supportBatch`
+   * cannot inspect read-time options (e.g. `hoodie.read.blob.inline.mode`) 
since it runs at
+   * planning time. Forcing row mode whenever BLOB columns are present is the 
simplest correct
+   * gate — BLOB processing is per-row anyway (lazy byte materialization) so 
the perf delta is
+   * negligible.
+   */
+  private def schemaContainsBlobColumn(schema: StructType): Boolean = {
+    schema.fields.exists { f =>
+      val md = f.metadata
+      md != null && md.contains(HoodieSchema.TYPE_METADATA_FIELD) &&
+        
HoodieSchema.parseTypeDescriptor(md.getString(HoodieSchema.TYPE_METADATA_FIELD))
+          .getType == HoodieSchemaType.BLOB
+    }
+  }
+
   /**
    * Checks if the file format supports vectorized reading, please refer to 
SPARK-40918.
    *

Review Comment:
   Done in fe8f7e1c2a56. `lanceBatchSupported = 
!schemaContainsBlobColumn(schema) && !internalSchemaOpt.isPresent`. The doc 
comment above is rewritten to call out both triggers (DESCRIPTOR blob mode and 
implicit type changes via internal-schema evolution) and explain that Spark’s 
`ColumnarToRowExec` would `ClassCastException` on the row-path iterator if 
either runtime fallback fired after the planner had committed to columnar 
output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Implement columnar batch reading for Lance (COW only) [hudi]

Reply via email to