HyukjinKwon commented on code in PR #50301:
URL: https://github.com/apache/spark/pull/50301#discussion_r2009374157
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonArrowOutput.scala:
##########
@@ -83,17 +89,37 @@ private[python] trait PythonArrowOutput[OUT <: AnyRef] {
self: BasePythonRunner[
throw writer.exception.get
}
try {
- if (reader != null && batchLoaded) {
+ if (batchLoaded && rowCount > 0 && currentRowIdx < rowCount) {
+ val batchRoot = if (arrowMaxRecordsPerOutputBatch > 0) {
+ val remainingRows = rowCount - currentRowIdx
+ if (remainingRows > arrowMaxRecordsPerOutputBatch) {
+ root.slice(currentRowIdx, arrowMaxRecordsPerOutputBatch)
+ } else {
+ root
+ }
+ } else {
+ root
+ }
+
+ currentRowIdx = currentRowIdx + batchRoot.getRowCount
+
+ vectors = batchRoot.getFieldVectors().asScala.map { vector =>
+ new ArrowColumnVector(vector)
+ }.toArray[ColumnVector]
+
+ val batch = new ColumnarBatch(vectors)
+ batch.setNumRows(batchRoot.getRowCount)
+ deserializeColumnarBatch(batch, schema)
Review Comment:
I think slicing by bytes is preferred always ..
But I have to agree that reusing code path is good, and this patch is
minimized. I am fine with going ahead with this for now, but I would like to
make sure the code change is isolated, and the configuration is internal so we
can refactor things more easily to support byte output, etc ..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]