Is there any way I can write the data from Apache Pig to the above
mentioned partition table via HCatalog?

On Tue, Oct 28, 2014 at 12:14 AM, Suraj Nayak <[email protected]> wrote:

> Hi Ryan,
>
> Thanks for the detailed info on total memory used.
>
> The output table is partitioned by 2 columns. 1 column has 2 output
> partition and other column has 187 partitions.
>
> parquet.block.size is default. I have not specified it anywhere. If you
> can help me get the exact value, it will be helpful.
>
> Am interested to know the tools for getting this done. Kindly share it :)
>
>
>
> On Mon, Oct 27, 2014 at 11:53 PM, Ryan Blue <[email protected]> wrote:
>
>> On 10/27/2014 11:10 AM, Suraj Nayak wrote:
>>
>>> Hi Parquet Developers,
>>>
>>> Am using Parquet for analytics on AWS EC2. A INSERT OVERWRITE from one
>>> small hive to another tables works fine with parquet with snappy. But
>>> when the file is having 1.2+ Billion records, the insert overwrite fails
>>> due to Java Heap issue.
>>>
>>> Few other stats about the data :
>>>
>>> Parquet version =1.2.5(CDH5.1.3)
>>> Total files in the HDFS directory : 555
>>> Attached PrintFooter Parquet.txt with the PrintFooter Information.
>>> (Note:- Changes the column names to random characters)
>>> HDFS block size : 256MB
>>>
>>> *Yarn configurations :*
>>> yarn.nodemanager.resource.memory-mb = 24GB
>>> yarn.scheduler.minimum-allocation-mb = 2GB
>>> mapreduce.map.memory.mb = 6GB
>>> mapreduce.reduce.memory.mb = 12GB
>>> mapreduce.map.java.opts = 4.5GB
>>> mapreduce.reduce.java.opts = 9GB
>>> yarn.nodemanager.vmem-pmem-ratio = 2.1
>>>
>>>
>>> Can anyone help me debug or help me fix this issue ? Is this fixed in
>>> any newer Parquet version (>v1.2.5)?
>>>
>>> Let me know in case if some one needs more information.
>>>
>>
>> Hi Suraj,
>>
>> Can you give us a little more info? Specifically:
>> 1. Approximately how many output partitions are you writing to?
>> 2. What is the parquet.block.size setting in your config?
>>
>> It looks like you're probably running into trouble with the number of
>> open files, so I'll assume that . . .
>>
>> Hive will open a file in each partition written to by a task, which will
>> use up to parquet.block.size bytes of memory (or about that). So the total
>> memory used is parquet.block.size * # partitions. You can fix the problem
>> by reducing the number of output partitions, which usually requires that
>> you break up the conversion into smaller jobs. You can also reduce the
>> parquet block size, but I don't recommend it.
>>
>> Another option is to use another parquet library for the conversion, and
>> shuffle the data in a MR round so that each reducer writes to just one
>> partition. If you need help doing this, I can recommend some tools.
>>
>> rb
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>
>
> --
> Thanks
> Suraj Nayak M
>



-- 
Thanks
Suraj Nayak M

Reply via email to