[GitHub] [spark] liuzqt commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

GitBox Wed, 12 Oct 2022 10:35:46 -0700


liuzqt commented on PR #38064:
URL: https://github.com/apache/spark/pull/38064#issuecomment-1276518707


   Some general comments about the performance implication regarding replacing 
`Array[Byte]` and `ByteBuffer`(backed by `Array[Byte]`) with 
`ChunkedByteBuffer`:
   - when reading from stream (i.e., `ByteArrayInputStream` vs 
`ChunkedByteBufferInputStream`), no much differences, while 
`ByteArrayInputStream` might a little bit win in terms of cache locality 
because of continuous memory, but `ChunkedByteBuffer` won't be too bad along as 
the chunk is reasonable
   - when we're writing to stream(i.e., `ByteArrayOutputStream` vs 
`ChunkedByteBufferOutputStream`) 
     - `ByteArrayOutputStream` start with a small buffer(32 bytes) and grow 2x 
exponentially, and have to do **array copy** every grow
     - `ChunkedByteBufferOutputStream` use fixed `chunk size` to grow(which you 
can specify when you create the stream), while the grow is **append style** 
instead of **copy style**
     - do some manual benchmark on large data, `ChunkedByteBufferOutputStream` 
is much faster, (tried different data size from 100MB to 1GB and different 
chunk size from 1KB to 1MB, can see at least ~2x speedup), I would attribute to 
array copy overhead mostly.
     - when eventually dump to `ByteBuffer`(or raw byte array) vs. 
`ChunkedByteBuffer`, the latter might waste some memory space in the last 
chunk, but not a big deal I believe. And in serialization they're the same.
     - after all, result collection is a small portion in the whole end-to-end 
query


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] liuzqt commented on pull request #38064: [SPARK-40622][SQL][CORE]Result of a single task in collect() must fit in 2GB

Reply via email to