albertlockett opened a new issue, #8386:
URL: https://github.com/apache/arrow-rs/issues/8386
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
We're developing a system that emits an arrow IPC stream of many record
batches, and we noticed that when we enable zstd compression we get quite a
high increase in CPU utilization.
Profiling revealed that we spend quite a lot of that time initialising the
IPC context:
```
__bzero [libsystem_platform.dylib]
ZSTD_cwksp_clean_tables
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_cwksp.h]
ZSTD_reset_matchState
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_compress.c]
ZSTD_resetCCtx_internal
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_compress.c]
ZSTD_compressBegin_internal
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_compress.c]
ZSTD_CCtx_init_compressStream2
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_compress.c]
ZSTD_compressStream2
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_compress.c]
ZSTD_compressStream
[cargo:index.crates.io-1949cf8c6b5b557f:zstd-sys-2.0.15+zstd.1.5.7:zstd/lib/compress/zstd_compress.c]
zstd_safe::CCtx::compress_stream [zstd-safe-7.2.4/src/lib.rs]
<zstd::stream::raw::Encoder as zstd::stream::raw::Operation>::run
[zstd-0.13.3/src/stream/raw.rs]
<zstd::stream::zio::reader::Reader<R,D> as std::io::Read>::read
[zstd-0.13.3/src/stream/zio/reader.rs]
<zstd::stream::read::Encoder<R> as std::io::Read>::read
[zstd-0.13.3/src/stream/read/mod.rs]
std::io::Read::read_buf::{{closure}}
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/mod.rs]
std::io::default_read_buf
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/mod.rs]
std::io::Read::read_buf
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/mod.rs]
std::io::default_read_to_end
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/mod.rs]
std::io::Read::read_to_end
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/mod.rs]
<alloc::vec::Vec<u8> as std::io::copy::BufferedWriterSpec>::copy_from
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/copy.rs]
std::io::copy::generic_copy
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/copy.rs]
std::io::copy::copy
[/Users/a.lockett/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/io/copy.rs]
arrow_ipc::compression::compress_zstd
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/compression.rs]
arrow_ipc::compression::CompressionCodec::compress
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/compression.rs]
arrow_ipc::compression::CompressionCodec::compress_to_vec
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/compression.rs]
arrow_ipc::writer::write_buffer
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/writer.rs]
arrow_ipc::writer::write_array_data
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/writer.rs]
arrow_ipc::writer::IpcDataGenerator::record_batch_to_bytes
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/writer.rs]
arrow_ipc::writer::IpcDataGenerator::encoded_batch
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/writer.rs]
arrow_ipc::writer::StreamWriter<W>::write
[/Users/a.lockett/Development/arrow-rs/arrow-ipc/src/writer.rs]
otel_arrow_rust::encode::producer::StreamProducer::serialize_batch
[/Users/a.lockett/Development/otel-arrow/rust/otel-arrow-rust/src/encode/producer.rs]
otel_arrow_rust::encode::producer::Producer::produce_bar
[/Users/a.lockett/Development/otel-arrow/rust/otel-arrow-rust/src/encode/producer.rs]
otap_df_otap::otap_exporter::create_req_stream::{{closure}}
[/Users/a.lockett/Development/otel-arrow/rust/otap-dataflow/crates/otap/src/otap_exporter.rs]
```
I think this is happening because we initialise a new zstd encoder for each
call to `compress_zstd`:
https://github.com/apache/arrow-rs/blob/f4840f6df1c2549ce0947305b7111edad638b445/arrow-ipc/src/compression.rs#L177-L184
The zstd docs claim that a performance improvement can be had by reusing the
context:
https://facebook.github.io/zstd/zstd_manual.html
> When compressing many times,
> it is recommended to allocate a context just once,
> and re-use it for each successive compression operation.
> This will make workload friendlier for system's memory.
> Note : re-using context is just a speed / resource optimization.
> It doesn't change the compression ratio, which remains identical.
> Note 2 : In multi-threaded environments,
> use one different context per thread for parallel execution.
I gave this a quick test using using a bulk::Compressor which is initialized
only once, and indeed I noticed that my workload was using ~50% less of a CPU
core (PoC code is here, but it's a bit of a mess:
https://github.com/apache/arrow-rs/commit/4d37b4ba15fa1dd5696d0697f4eeb380b055f669)
**Describe the solution you'd like**
<!--
A clear and concise description of what you want to happen.
-->
I'd like it if we could reus reuse the Zstd Context between batches which
would maybe improve CPU performance
**Describe alternatives you've considered**
<!--
A clear and concise description of any alternative solutions or features
you've considered.
-->
**Additional context**
<!--
Add any other context or screenshots about the feature request here.
-->
related to https://github.com/open-telemetry/otel-arrow/issues/1129
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]