Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

via GitHub Mon, 13 Apr 2026 00:43:09 -0700


wombatu-kun commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3071515342



##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##########
@@ -70,6 +70,22 @@ public String getFileExtension() {
     return extension;
   }
 
+  /**
+   * Returns true if this file format requires the SPARK record type for 
reading/writing.
+   * Lance only supports the Spark-native InternalRow representation, not Avro.
+   */
+  public boolean requiresSparkRecordType() {
+    return this == LANCE;
+  }
+
+  /**

Review Comment:
   Agreed this was a usability hole. Moved the merger auto-selection into 
`HoodieWriteConfig.Builder.setDefaults()`: when the resolved base file format 
`requiresSparkRecordType()` and the user hasn't set 
`hoodie.write.record.merge.custom.implementation.classes`, we inject 
`org.apache.hudi.DefaultSparkRecordMerger` into `RECORD_MERGE_IMPL_CLASSES`. 
Doing it at build time (not inside `getRecordMerger()`) means every downstream 
consumer — write handles, read handles, metadata writer, etc. — sees a 
SPARK-typed merger consistently. Removed the manual 
`hoodie.write.record.merge.custom.implementation.classes = 
DefaultSparkRecordMerger` override from all 4 Lance test sites. Class name is 
referenced as a string so `hudi-client-common` does not take a compile-time dep 
on `hudi-spark-client`.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########
@@ -76,8 +76,21 @@ class SparkLanceReaderBase(enableVectorizedReader: Boolean) 
extends SparkColumna
     val filePath = file.filePath.toString
 
     if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
-      // No columns requested - return empty iterator
-      Iterator.empty
+      // No column data needed (e.g. count() on a non-partitioned table). Open 
the Lance file to
+      // get the actual row count and return that many empty rows, so Spark 
computes the correct count.
+      val countAllocator = HoodieArrowAllocator.newChildAllocator(
+        getClass.getSimpleName + "-count-" + file.filePath, 
HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE)
+      try {
+        val lanceReader = LanceFileReader.open(file.filePath.toString, 
countAllocator)
+        try {
+          val rowCount = Math.toIntExact(lanceReader.numRows())
+          Iterator.fill(rowCount)(InternalRow.empty)

Review Comment:
   Switched the count path to a lazy `Iterator[InternalRow]` backed by a `Long` 
counter — removes the ~2.1B row cap entirely, no more 
`ArithmeticException`/`toIntExact`. `Iterator.fill`/`Iterator.continually.take` 
both take `Int`, so this needed a small custom iterator.
   
   On the "full open vs footer-only" question: `LanceFileReader.open(...)` 
reads only the file footer/manifest, not column data — `numRows()` comes 
straight from metadata. It's analogous to opening a Parquet `ParquetFileReader` 
just to get `getRecordCount()` from the footer. There's no cheaper row-count 
API exposed by the Lance Java SDK at 1.0.2; if one shows up we can switch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(lance): Ensure Spark-SQL works correctly when using lance as base file format [hudi]

Reply via email to