Is there any way I can write the data from Apache Pig to the above mentioned partition table via HCatalog?
On Tue, Oct 28, 2014 at 12:14 AM, Suraj Nayak <[email protected]> wrote: > Hi Ryan, > > Thanks for the detailed info on total memory used. > > The output table is partitioned by 2 columns. 1 column has 2 output > partition and other column has 187 partitions. > > parquet.block.size is default. I have not specified it anywhere. If you > can help me get the exact value, it will be helpful. > > Am interested to know the tools for getting this done. Kindly share it :) > > > > On Mon, Oct 27, 2014 at 11:53 PM, Ryan Blue <[email protected]> wrote: > >> On 10/27/2014 11:10 AM, Suraj Nayak wrote: >> >>> Hi Parquet Developers, >>> >>> Am using Parquet for analytics on AWS EC2. A INSERT OVERWRITE from one >>> small hive to another tables works fine with parquet with snappy. But >>> when the file is having 1.2+ Billion records, the insert overwrite fails >>> due to Java Heap issue. >>> >>> Few other stats about the data : >>> >>> Parquet version =1.2.5(CDH5.1.3) >>> Total files in the HDFS directory : 555 >>> Attached PrintFooter Parquet.txt with the PrintFooter Information. >>> (Note:- Changes the column names to random characters) >>> HDFS block size : 256MB >>> >>> *Yarn configurations :* >>> yarn.nodemanager.resource.memory-mb = 24GB >>> yarn.scheduler.minimum-allocation-mb = 2GB >>> mapreduce.map.memory.mb = 6GB >>> mapreduce.reduce.memory.mb = 12GB >>> mapreduce.map.java.opts = 4.5GB >>> mapreduce.reduce.java.opts = 9GB >>> yarn.nodemanager.vmem-pmem-ratio = 2.1 >>> >>> >>> Can anyone help me debug or help me fix this issue ? Is this fixed in >>> any newer Parquet version (>v1.2.5)? >>> >>> Let me know in case if some one needs more information. >>> >> >> Hi Suraj, >> >> Can you give us a little more info? Specifically: >> 1. Approximately how many output partitions are you writing to? >> 2. What is the parquet.block.size setting in your config? >> >> It looks like you're probably running into trouble with the number of >> open files, so I'll assume that . . . >> >> Hive will open a file in each partition written to by a task, which will >> use up to parquet.block.size bytes of memory (or about that). So the total >> memory used is parquet.block.size * # partitions. You can fix the problem >> by reducing the number of output partitions, which usually requires that >> you break up the conversion into smaller jobs. You can also reduce the >> parquet block size, but I don't recommend it. >> >> Another option is to use another parquet library for the conversion, and >> shuffle the data in a MR round so that each reducer writes to just one >> partition. If you need help doing this, I can recommend some tools. >> >> rb >> >> >> -- >> Ryan Blue >> Software Engineer >> Cloudera, Inc. >> > > > > -- > Thanks > Suraj Nayak M > -- Thanks Suraj Nayak M
