Hi Ryan, Thanks for the detailed info on total memory used.
The output table is partitioned by 2 columns. 1 column has 2 output partition and other column has 187 partitions. parquet.block.size is default. I have not specified it anywhere. If you can help me get the exact value, it will be helpful. Am interested to know the tools for getting this done. Kindly share it :) On Mon, Oct 27, 2014 at 11:53 PM, Ryan Blue <[email protected]> wrote: > On 10/27/2014 11:10 AM, Suraj Nayak wrote: > >> Hi Parquet Developers, >> >> Am using Parquet for analytics on AWS EC2. A INSERT OVERWRITE from one >> small hive to another tables works fine with parquet with snappy. But >> when the file is having 1.2+ Billion records, the insert overwrite fails >> due to Java Heap issue. >> >> Few other stats about the data : >> >> Parquet version =1.2.5(CDH5.1.3) >> Total files in the HDFS directory : 555 >> Attached PrintFooter Parquet.txt with the PrintFooter Information. >> (Note:- Changes the column names to random characters) >> HDFS block size : 256MB >> >> *Yarn configurations :* >> yarn.nodemanager.resource.memory-mb = 24GB >> yarn.scheduler.minimum-allocation-mb = 2GB >> mapreduce.map.memory.mb = 6GB >> mapreduce.reduce.memory.mb = 12GB >> mapreduce.map.java.opts = 4.5GB >> mapreduce.reduce.java.opts = 9GB >> yarn.nodemanager.vmem-pmem-ratio = 2.1 >> >> >> Can anyone help me debug or help me fix this issue ? Is this fixed in >> any newer Parquet version (>v1.2.5)? >> >> Let me know in case if some one needs more information. >> > > Hi Suraj, > > Can you give us a little more info? Specifically: > 1. Approximately how many output partitions are you writing to? > 2. What is the parquet.block.size setting in your config? > > It looks like you're probably running into trouble with the number of open > files, so I'll assume that . . . > > Hive will open a file in each partition written to by a task, which will > use up to parquet.block.size bytes of memory (or about that). So the total > memory used is parquet.block.size * # partitions. You can fix the problem > by reducing the number of output partitions, which usually requires that > you break up the conversion into smaller jobs. You can also reduce the > parquet block size, but I don't recommend it. > > Another option is to use another parquet library for the conversion, and > shuffle the data in a MR round so that each reducer writes to just one > partition. If you need help doing this, I can recommend some tools. > > rb > > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. > -- Thanks Suraj Nayak M
