liuzqt commented on PR #38064:
URL: https://github.com/apache/spark/pull/38064#issuecomment-1276518707
Some general comments about the performance implication regarding replacing
`Array[Byte]` and `ByteBuffer`(backed by `Array[Byte]`) with
`ChunkedByteBuffer`:
- when reading from stream (i.e., `ByteArrayInputStream` vs
`ChunkedByteBufferInputStream`), no much differences, while
`ByteArrayInputStream` might a little bit win in terms of cache locality
because of continuous memory, but `ChunkedByteBuffer` won't be too bad along as
the chunk is reasonable
- when we're writing to stream(i.e., `ByteArrayOutputStream` vs
`ChunkedByteBufferOutputStream`)
- `ByteArrayOutputStream` start with a small buffer(32 bytes) and grow 2x
exponentially, and have to do **array copy** every grow
- `ChunkedByteBufferOutputStream` use fixed `chunk size` to grow(which you
can specify when you create the stream), while the grow is **append style**
instead of **copy style**
- do some manual benchmark on large data, `ChunkedByteBufferOutputStream`
is much faster, (tried different data size from 100MB to 1GB and different
chunk size from 1KB to 1MB, can see at least ~2x speedup), I would attribute to
array copy overhead mostly.
- when eventually dump to `ByteBuffer`(or raw byte array) vs.
`ChunkedByteBuffer`, the latter might waste some memory space in the last
chunk, but not a big deal I believe. And in serialization they're the same.
- after all, result collection is a small portion in the whole end-to-end
query
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]