xi-db opened a new pull request, #52271:
URL: https://github.com/apache/spark/pull/52271

   ### What changes were proposed in this pull request?
   
   Currently, we enforce gRPC message limits on both the client and the server. 
These limits are largely meant to protect both sides from potential OOMs by 
rejecting abnormally large messages. However, there are cases in which the 
server incorrectly sends oversized messages that exceed these limits and cause 
execution failures.
   
   Specifically, the large message issue from the server to the client we’re 
solving here, comes from the Arrow batch data in ExecutePlanResponse being too 
large. It’s caused by a single arrow row exceeding the 128MB message limit, and 
Arrow cannot partition further and return the single large row in one gRPC 
message.
   
   To improve Spark Connect stability, this PR implements chunking large Arrow 
batches when returning query results from the server to the client, ensuring 
each ExecutePlanResponse chunk remains within the size limit, and the chunks 
from a batch will be reassembled on the client when parsing as an arrow batch.
   
   (Scala client changes are being implemented in a followup PR.)
   
   To reproduce the issue:
   ```
   repeat_num_per_mb = 1024 * 1024 // len('Apache Spark ')
   res = spark.sql(f"select repeat('Apache Spark ', {repeat_num_per_mb * 300}) 
as huge_col from range(1)").collect()
   print(len(res))
   ```
   It will fails with `StatusCode.RESOURCE_EXHAUSTED` error with message 
`Received message larger than max (314570608 vs. 134217728)`.
   
   With the improvement introduced by the PR, the above code runs successfully 
and prints the expected result.
   
   ### Why are the changes needed?
   
   It improves Spark Connect stability when returning large rows.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New tests on both server side and client side.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to