We tried with lower block size for lzf, but it barfed all over the place.
Snappy was the way to go for our jobs.


Regards,
Mridul


On Mon, Jul 14, 2014 at 12:31 PM, Reynold Xin <r...@databricks.com> wrote:
> Hi Spark devs,
>
> I was looking into the memory usage of shuffle and one annoying thing is
> the default compression codec (LZF) is that the implementation we use
> allocates buffers pretty generously. I did a simple experiment and found
> that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
> we have a shuffle task that uses 10k reducers and 32 threads running
> currently, the memory used by the lzf stream alone would be ~ 60GB.
>
> In comparison, Snappy only allocates ~ 65MB for every
> 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
> LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
> ratio does matter here because we are sending data across the network.
>
> In future releases we will likely change the shuffle implementation to open
> less streams. Until that happens, I'm looking for compression codec
> implementations that are fast, allocate small buffers, and have decent
> compression ratio.
>
> Does anybody on this list have any suggestions? If not, I will submit a
> patch for 1.1 that replaces LZF with Snappy for the default compression
> codec to lower memory usage.
>
>
> allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567

Reply via email to