On 10/27/2014 11:10 AM, Suraj Nayak wrote:
Hi Parquet Developers,
Am using Parquet for analytics on AWS EC2. A INSERT OVERWRITE from one
small hive to another tables works fine with parquet with snappy. But
when the file is having 1.2+ Billion records, the insert overwrite fails
due to Java Heap issue.
Few other stats about the data :
Parquet version =1.2.5(CDH5.1.3)
Total files in the HDFS directory : 555
Attached PrintFooter Parquet.txt with the PrintFooter Information.
(Note:- Changes the column names to random characters)
HDFS block size : 256MB
*Yarn configurations :*
yarn.nodemanager.resource.memory-mb = 24GB
yarn.scheduler.minimum-allocation-mb = 2GB
mapreduce.map.memory.mb = 6GB
mapreduce.reduce.memory.mb = 12GB
mapreduce.map.java.opts = 4.5GB
mapreduce.reduce.java.opts = 9GB
yarn.nodemanager.vmem-pmem-ratio = 2.1
Can anyone help me debug or help me fix this issue ? Is this fixed in
any newer Parquet version (>v1.2.5)?
Let me know in case if some one needs more information.
Hi Suraj,
Can you give us a little more info? Specifically:
1. Approximately how many output partitions are you writing to?
2. What is the parquet.block.size setting in your config?
It looks like you're probably running into trouble with the number of
open files, so I'll assume that . . .
Hive will open a file in each partition written to by a task, which will
use up to parquet.block.size bytes of memory (or about that). So the
total memory used is parquet.block.size * # partitions. You can fix the
problem by reducing the number of output partitions, which usually
requires that you break up the conversion into smaller jobs. You can
also reduce the parquet block size, but I don't recommend it.
Another option is to use another parquet library for the conversion, and
shuffle the data in a MR round so that each reducer writes to just one
partition. If you need help doing this, I can recommend some tools.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.