Maybe we could try LZ4 [1], which has better performance and smaller footprint
than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x
higher than LZF[2],
but memory used is 10x smaller than LZF (16k vs 190k).

[1] https://github.com/jpountz/lz4-java
[2] 
http://ning.github.io/jvm-compressor-benchmark/results/calgary/roundtrip-2013-06-06/index.html


On Mon, Jul 14, 2014 at 12:01 AM, Reynold Xin <r...@databricks.com> wrote:
>
> Hi Spark devs,
>
> I was looking into the memory usage of shuffle and one annoying thing is
> the default compression codec (LZF) is that the implementation we use
> allocates buffers pretty generously. I did a simple experiment and found
> that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
> we have a shuffle task that uses 10k reducers and 32 threads running
> currently, the memory used by the lzf stream alone would be ~ 60GB.
>
> In comparison, Snappy only allocates ~ 65MB for every
> 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
> LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
> ratio does matter here because we are sending data across the network.
>
> In future releases we will likely change the shuffle implementation to open
> less streams. Until that happens, I'm looking for compression codec
> implementations that are fast, allocate small buffers, and have decent
> compression ratio.
>
> Does anybody on this list have any suggestions? If not, I will submit a
> patch for 1.1 that replaces LZF with Snappy for the default compression
> codec to lower memory usage.
>
>
> allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567

Reply via email to