Hi Ryan, thanks for the clarification this makes sense.
Thanks, On Mon, Jan 26, 2015 at 11:56 AM, Ryan Blue <[email protected]> wrote: > On 01/26/2015 10:06 AM, Jean-Pascal Billaud wrote: > >> Hi, >> >> I bumped into this change on the parquet-mr project and was wondering >> about >> the impact on flushing more often in the case more writers are being >> created. I am assuming that column blocks won't be as full therefore >> adding >> some disk seeks possibly, less efficient compression... Sure this is all >> better than OOM though I just would like to understand the trade-offs in >> your experience. >> >> Thanks, >> > > This is more of a safety valve than something you want to hit, and a kind > of last resort for the writers. It is better to write the data > inefficiently than to crash, but we want to show a warning message when > this happens so that you can restructure your writes to avoid it happening. > > How damaging this is to performance after the data is written depends on > how far over the memory boundary you go. Let's assume the rest of the > program takes ~150MB of memory, you have 1GB of heap, and you're using the > memory manager at 80% of available heap. 20% for non-Parquet is ~204MB, > which is fine for the 150MB. That leaves 819.2MB for Parquet files, which > is 6 at 128MB row groups. Adding a 7th file means we rebalance that > allocation by dividing by 7 to get 117MB row groups. > > Going from 128MB to 117MB row groups isn't a big deal. If you have 10 > columns, each column is about 1 page shorter: you'll still have ~11.7MB > column chunks instead of ~12.8MB. > > The problem is that this gets progressively worse as you increase the > number of files. If you were writing to 12 files instead of 6, then your > column chunks are half the size, which is very different from what you > configured. Is that really bad? It depends. Typically, you want your row > group size setting to be as small as possible to get the right trade-off of > I/O and memory, so running out of memory should be a bad thing. > > It is much better to structure your writes so that you only write to a few > files at a time (Kite, for example, uses one or two) and allocate memory to > handle that. > > rb > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
