Re: PARQUET-108: Parquet Memory Management in Java

Ryan Blue Mon, 26 Jan 2015 11:58:33 -0800

On 01/26/2015 10:06 AM, Jean-Pascal Billaud wrote:

Hi,


I bumped into this change on the parquet-mr project and was wondering about
the impact on flushing more often in the case more writers are being
created. I am assuming that column blocks won't be as full therefore adding
some disk seeks possibly, less efficient compression... Sure this is all
better than OOM though I just would like to understand the trade-offs in
your experience.

Thanks,

This is more of a safety valve than something you want to hit, and akind of last resort for the writers. It is better to write the datainefficiently than to crash, but we want to show a warning message whenthis happens so that you can restructure your writes to avoid it happening.

How damaging this is to performance after the data is written depends onhow far over the memory boundary you go. Let's assume the rest of theprogram takes ~150MB of memory, you have 1GB of heap, and you're usingthe memory manager at 80% of available heap. 20% for non-Parquet is~204MB, which is fine for the 150MB. That leaves 819.2MB for Parquetfiles, which is 6 at 128MB row groups. Adding a 7th file means werebalance that allocation by dividing by 7 to get 117MB row groups.

Going from 128MB to 117MB row groups isn't a big deal. If you have 10columns, each column is about 1 page shorter: you'll still have ~11.7MBcolumn chunks instead of ~12.8MB.

The problem is that this gets progressively worse as you increase thenumber of files. If you were writing to 12 files instead of 6, then yourcolumn chunks are half the size, which is very different from what youconfigured. Is that really bad? It depends. Typically, you want your rowgroup size setting to be as small as possible to get the right trade-offof I/O and memory, so running out of memory should be a bad thing.

It is much better to structure your writes so that you only write to afew files at a time (Kite, for example, uses one or two) and allocatememory to handle that.


rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: PARQUET-108: Parquet Memory Management in Java

Reply via email to