[GitHub] [spark] liuzqt commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

GitBox Thu, 13 Oct 2022 17:38:46 -0700


liuzqt commented on code in PR #38064:
URL: https://github.com/apache/spark/pull/38064#discussion_r995232173



##########
core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala:
##########
@@ -172,6 +247,8 @@ private[spark] class ChunkedByteBuffer(var chunks: 
Array[ByteBuffer]) {
 
 private[spark] object ChunkedByteBuffer {
 
+  val COPY_BUFFER_LEN: Int = 1024 * 1024

Review Comment:
   I added a `def estimateBufferChunkSize(estimatedSize: Long = -1)` to be used 
for both. But I'm not sure if the heuristic is appropriate.....
   
   Or another option: we can use `1024`(1KB) for all, make it simple. I did 
some quick benchmark, 1KB isn't too bad compared to 1MB even in large result, 
and the overhead upper bound is reasonable even when result is very 
tiny(actually even a nearly empty result will still be serialized to a few 
hundred bytes because of other metrics and accumulators)
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] liuzqt commented on a diff in pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

Reply via email to