rahil-c opened a new issue, #18727:
URL: https://github.com/apache/hudi/issues/18727
## TL;DR
`SELECT COUNT(*) FROM <lance-backed hudi table>` fails with:
```
Lance batch column count 14 does not match expected Spark schema size 0
for file: .../category=Abyssinian/....lance
at
org.apache.hudi.io.storage.LanceRecordIterator.hasNext(LanceRecordIterator.java:124)
```
Any query shape that triggers Spark's "no columns needed, just count rows"
optimization (`COUNT(*)`, `EXISTS`, `CREATE TABLE AS SELECT 1 FROM ...`) blows
up on a Lance-backed Hudi table. Parquet-backed tables work fine.
## Why it happens
[`LanceRecordIterator.java:122-127`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java#L122-L127)
has a strict equality check when building `ColumnVector[]`:
```java
StructField[] sparkFields = sparkSchema.fields();
if (sparkFields.length != fieldVectors.size()) {
throw new HoodieException("Lance batch column count " + fieldVectors.size()
+ " does not match expected Spark schema size " + sparkFields.length +
...);
}
```
When Spark's optimizer prunes all columns for an aggregate-only read
(`COUNT`, `EXISTS`), the request arrives with `sparkSchema.fields().length ==
0`, but the Lance file's batch always has the full column set. The reader sees
`0 != 14` and throws.
The Parquet reader handles this naturally — `ParquetFileFormat` has a
zero-column fast path where it just yields N empty rows (where N is the row
count) so the aggregate can count them without reading any data. Lance needs
the equivalent.
## Workaround
Use `COUNT(<named_col>)` instead of `COUNT(*)`. On a non-null primary key
the two are semantically equivalent, but the former forces Spark to request one
column, satisfying the check.
## Proposed fix
In `LanceRecordIterator.hasNext()`:
- If `sparkSchema.fields().length == 0`, skip the `ColumnVector[]` build
entirely.
- Still call `arrowReader.loadNextBatch()` to advance, and yield empty rows
matching the Arrow `VectorSchemaRoot.getRowCount()` so downstream count
aggregators work.
- Add a test in
[`TestLanceDataSource.scala`](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala)
exercising `spark.sql("SELECT COUNT(*) FROM …")` over a Lance-backed table and
`df.count()` on the same.
## Related code paths
-
[`LanceRecordIterator.java`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java)
-
[`HoodieSparkLanceReader.java`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java)
-
[`TestLanceDataSource.scala`](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala)
## Environment
- Hudi `master` @ commit `4d0e9cd47f9e`
- Spark datasource path with Lance-backed base files
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]