Yicong-Huang opened a new pull request, #53461:
URL: https://github.com/apache/spark/pull/53461

   ### What changes were proposed in this pull request?
   
   This PR introduces an `ArrowBatch` case class to encapsulate Arrow batch 
metadata alongside the serialized batch data.
   
   Changes:
   1. **Added `ArrowBatch` case class** - A new case class in 
`org.apache.spark.sql.execution.arrowcollect` package that holds `rowCount` and 
`batch` (byte array)
   2. **Updated `ArrowBatchIterator` return type** - Changed from 
`Iterator[Array[Byte]]` to `Iterator[ArrowBatch]`
   3. **Updated `ArrowBatchWithSchemaIterator` return type** - Similarly 
changed to return `ArrowBatch`
   4. **Updated call sites** - Modified `Dataset.scala` and other call sites to 
extract `.batch` when only bytes are needed
   5. **Added tests** - New tests for `ArrowBatch` functionality and verifying 
row counts from the iterator
   
   ### Why are the changes needed?
   
   Currently, the Arrow batch iterators only return the serialized byte arrays, 
discarding useful metadata like the row count. This information is computed 
during batch creation but not exposed to callers.
   
   By returning `ArrowBatch` with both `rowCount` and `batch`, downstream 
consumers can:
   - Know the exact number of rows in each batch without deserializing
   - Make better decisions about batch processing and memory management
   - Enable future optimizations that rely on batch-level metadata
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is an internal API change. The `ArrowBatch` class is marked as 
`private[sql]`.
   
   ### How was this patch tested?
   
   1. Updated existing unit tests in `ArrowConvertersSuite` to work with new 
return type
   2. Added new tests:
      - `ArrowBatch case class basic functionality` - Tests creation, field 
access, and copy
      - `ArrowBatch iterator from toBatchIterator` - Verifies row counts are 
correctly reported
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to