wombatu-kun commented on code in PR #18375:
URL: https://github.com/apache/hudi/pull/18375#discussion_r3071515342
##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java:
##########
@@ -70,6 +70,22 @@ public String getFileExtension() {
return extension;
}
+ /**
+ * Returns true if this file format requires the SPARK record type for
reading/writing.
+ * Lance only supports the Spark-native InternalRow representation, not Avro.
+ */
+ public boolean requiresSparkRecordType() {
+ return this == LANCE;
+ }
+
+ /**
Review Comment:
Agreed this was a usability hole. Moved the merger auto-selection into
`HoodieWriteConfig.Builder.setDefaults()`: when the resolved base file format
`requiresSparkRecordType()` and the user hasn't set
`hoodie.write.record.merge.custom.implementation.classes`, we inject
`org.apache.hudi.DefaultSparkRecordMerger` into `RECORD_MERGE_IMPL_CLASSES`.
Doing it at build time (not inside `getRecordMerger()`) means every downstream
consumer — write handles, read handles, metadata writer, etc. — sees a
SPARK-typed merger consistently. Removed the manual
`hoodie.write.record.merge.custom.implementation.classes =
DefaultSparkRecordMerger` override from all 4 Lance test sites. Class name is
referenced as a string so `hudi-client-common` does not take a compile-time dep
on `hudi-spark-client`.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/lance/SparkLanceReaderBase.scala:
##########
@@ -76,8 +76,21 @@ class SparkLanceReaderBase(enableVectorizedReader: Boolean)
extends SparkColumna
val filePath = file.filePath.toString
if (requiredSchema.isEmpty && partitionSchema.isEmpty) {
- // No columns requested - return empty iterator
- Iterator.empty
+ // No column data needed (e.g. count() on a non-partitioned table). Open
the Lance file to
+ // get the actual row count and return that many empty rows, so Spark
computes the correct count.
+ val countAllocator = HoodieArrowAllocator.newChildAllocator(
+ getClass.getSimpleName + "-count-" + file.filePath,
HoodieSparkLanceReader.LANCE_DATA_ALLOCATOR_SIZE)
+ try {
+ val lanceReader = LanceFileReader.open(file.filePath.toString,
countAllocator)
+ try {
+ val rowCount = Math.toIntExact(lanceReader.numRows())
+ Iterator.fill(rowCount)(InternalRow.empty)
Review Comment:
Switched the count path to a lazy `Iterator[InternalRow]` backed by a `Long`
counter — removes the ~2.1B row cap entirely, no more
`ArithmeticException`/`toIntExact`. `Iterator.fill`/`Iterator.continually.take`
both take `Int`, so this needed a small custom iterator.
On the "full open vs footer-only" question: `LanceFileReader.open(...)`
reads only the file footer/manifest, not column data — `numRows()` comes
straight from metadata. It's analogous to opening a Parquet `ParquetFileReader`
just to get `getRecordCount()` from the footer. There's no cheaper row-count
API exposed by the Lance Java SDK at 1.0.2; if one shows up we can switch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]