Hi Spark devs,

I was looking into the memory usage of shuffle and one annoying thing is
the default compression codec (LZF) is that the implementation we use
allocates buffers pretty generously. I did a simple experiment and found
that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
we have a shuffle task that uses 10k reducers and 32 threads running
currently, the memory used by the lzf stream alone would be ~ 60GB.

In comparison, Snappy only allocates ~ 65MB for every
1k SnappyOutputStream. However, Snappy's compression is slightly lower than
LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
ratio does matter here because we are sending data across the network.

In future releases we will likely change the shuffle implementation to open
less streams. Until that happens, I'm looking for compression codec
implementations that are fast, allocate small buffers, and have decent
compression ratio.

Does anybody on this list have any suggestions? If not, I will submit a
patch for 1.1 that replaces LZF with Snappy for the default compression
codec to lower memory usage.


allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567

Reply via email to