There are some inefficiencies with very large schemas (1000+) currently.
This is going to be improved soon.
See https://github.com/apache/incubator-parquet-mr/pull/98 that is
addressing this.
Basically there are 4 * #columns empty buffers with an initial size of 64K.
So reducing the row group does not help too much here.
You can temporarily increase the heap size for your tasks when you have
large schemas.
We currently have a schema with 4000+ columns which works as long as you
bump the heap size of those tasks (which is not ideal)


On Fri, Feb 13, 2015 at 1:56 AM, Jianshi Huang <[email protected]>
wrote:

> Hi,
>
> I'm converting my Pig dataset of 2700+ columns to Parquet format.
>
> I set parquet.block.size to be 1GB and I'm still getting OOM issues.
>
> Is it still too small (I guess there's only 1 row group, it's the case for
> another dataset with 600+ columns)? Is there a setting to specify the
> number of row groups?
>
> Thanks,
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Reply via email to