Re: Parquet OOME Java heap space on Large table on Insert Overwrite

Ryan Blue Mon, 27 Oct 2014 11:26:00 -0700

On 10/27/2014 11:10 AM, Suraj Nayak wrote:

Hi Parquet Developers,


Am using Parquet for analytics on AWS EC2. A INSERT OVERWRITE from one
small hive to another tables works fine with parquet with snappy. But
when the file is having 1.2+ Billion records, the insert overwrite fails
due to Java Heap issue.

Few other stats about the data :

Parquet version =1.2.5(CDH5.1.3)
Total files in the HDFS directory : 555
Attached PrintFooter Parquet.txt with the PrintFooter Information.
(Note:- Changes the column names to random characters)
HDFS block size : 256MB

*Yarn configurations :*
yarn.nodemanager.resource.memory-mb = 24GB
yarn.scheduler.minimum-allocation-mb = 2GB
mapreduce.map.memory.mb = 6GB
mapreduce.reduce.memory.mb = 12GB
mapreduce.map.java.opts = 4.5GB
mapreduce.reduce.java.opts = 9GB
yarn.nodemanager.vmem-pmem-ratio = 2.1


Can anyone help me debug or help me fix this issue ? Is this fixed in
any newer Parquet version (>v1.2.5)?

Let me know in case if some one needs more information.


Hi Suraj,

Can you give us a little more info? Specifically:
1. Approximately how many output partitions are you writing to?
2. What is the parquet.block.size setting in your config?

It looks like you're probably running into trouble with the number ofopen files, so I'll assume that . . .

Hive will open a file in each partition written to by a task, which willuse up to parquet.block.size bytes of memory (or about that). So thetotal memory used is parquet.block.size * # partitions. You can fix theproblem by reducing the number of output partitions, which usuallyrequires that you break up the conversion into smaller jobs. You canalso reduce the parquet block size, but I don't recommend it.

Another option is to use another parquet library for the conversion, andshuffle the data in a MR round so that each reducer writes to just onepartition. If you need help doing this, I can recommend some tools.


rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Parquet OOME Java heap space on Large table on Insert Overwrite

Reply via email to