Hi Ryan,

thanks for the clarification this makes sense.

Thanks,

On Mon, Jan 26, 2015 at 11:56 AM, Ryan Blue <[email protected]> wrote:

> On 01/26/2015 10:06 AM, Jean-Pascal Billaud wrote:
>
>> Hi,
>>
>> I bumped into this change on the parquet-mr project and was wondering
>> about
>> the impact on flushing more often in the case more writers are being
>> created. I am assuming that column blocks won't be as full therefore
>> adding
>> some disk seeks possibly, less efficient compression... Sure this is all
>> better than OOM though I just would like to understand the trade-offs in
>> your experience.
>>
>> Thanks,
>>
>
> This is more of a safety valve than something you want to hit, and a kind
> of last resort for the writers. It is better to write the data
> inefficiently than to crash, but we want to show a warning message when
> this happens so that you can restructure your writes to avoid it happening.
>
> How damaging this is to performance after the data is written depends on
> how far over the memory boundary you go. Let's assume the rest of the
> program takes ~150MB of memory, you have 1GB of heap, and you're using the
> memory manager at 80% of available heap. 20% for non-Parquet is ~204MB,
> which is fine for the 150MB. That leaves 819.2MB for Parquet files, which
> is 6 at 128MB row groups. Adding a 7th file means we rebalance that
> allocation by dividing by 7 to get 117MB row groups.
>
> Going from 128MB to 117MB row groups isn't a big deal. If you have 10
> columns, each column is about 1 page shorter: you'll still have ~11.7MB
> column chunks instead of ~12.8MB.
>
> The problem is that this gets progressively worse as you increase the
> number of files. If you were writing to 12 files instead of 6, then your
> column chunks are half the size, which is very different from what you
> configured. Is that really bad? It depends. Typically, you want your row
> group size setting to be as small as possible to get the right trade-off of
> I/O and memory, so running out of memory should be a bad thing.
>
> It is much better to structure your writes so that you only write to a few
> files at a time (Kite, for example, uses one or two) and allocate memory to
> handle that.
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Reply via email to