I will recommend following things in the order:

1) Increasing heap size should help.
2) It seems you are on 0.7. There are couple of memory fixes we have
committed both on 0.7 branch as well as on trunk. Those should help as
well. So, build Pig either from trunk or 0.7 branch and use that.
3) Only if these dont help, you can try tuning the param
pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
should help. Try with 0.05, 0.02 and then further down. Downside is,
as you go lower and lower, it will make your query go slower.

Let us know if these changes get your query to completion.

Ashutosh

On Thu, Jul 8, 2010 at 15:48, Syed Wasti <mdwa...@hotmail.com> wrote:
> Thanks Ashutosh, is there any workaround for this, will increasing the heap
> size help ?
>
>
> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote:
>
>> Syed,
>>
>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
>> Your query and stacktrace look very similar to the one in the jira
>> ticket. This may get fixed by 0.8 release.
>>
>> Ashutosh
>>
>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <mdwa...@hotmail.com> wrote:
>>> Sorry about the delay, was held with different things.
>>> Here is the script and the errors below;
>>>
>>> AA = LOAD 'table1' USING PigStorage('\t') as
>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>>>
>>> AB = FOREACH AA GENERATE ID, e, f, n,o;
>>>
>>> AC = FILTER AB BY o == 1;
>>>
>>> AD = GROUP AC BY (ID, b);
>>>
>>> AE = FOREACH AD { A = DISTINCT AC.d;
>>>        GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }
>>>
>>> The same steps are repeated to load 5 different tables and then a UNION is
>>> done on them.
>>>
>>> Final_res = UNION AE, AF, AG, AH, AI;
>>>
>>> The actual number of columns will be 15 here I am showing with one table.
>>>
>>> Final_table =   FOREACH Final_res GENERATE ID,
>>>                (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
>>>                (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
>>>                (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
>>>                (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;
>>>
>>> It works fine until here, it is only after adding this last part of the
>>> query it starts throwing heap errors.
>>>
>>> grp_id =    GROUP Final_table BY ID;
>>>
>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID
>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);
>>>
>>> STORE Final_data;
>>>
>>>
>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>  at java.util.ArrayList.(ArrayList.java:112)
>>>  at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
>>>  at
>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
>>> )
>>>  at
>>> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
>>>  at
>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>>>  at
>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
>>>  at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
>>>  at
>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
>>> va:114)
>>>  at
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
>>> eserialize(WritableSerialization.java:67)
>>>  at
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
>>> eserialize(WritableSerialization.java:40)
>>>  at
>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
>>> 6)
>>>  at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>>>  at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
>>> 227)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
>>> 8)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
>>> a:1135)
>>>
>>>
>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
>>> ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
>>> ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
>>> bine.processOnePackageOutput(PigCombiner.java:168)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
>>> bine.reduce(PigCombiner.java:159)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
>>> bine.reduce(PigCombiner.java:50)
>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>>  at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
>>> 227)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
>>> 8)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
>>> a:1135)
>>>
>>>
>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>  at java.util.AbstractList.iterator(AbstractList.java:273)
>>>  at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
>>>  at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat
>>> ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
>>> bine.processOnePackageOutput(PigCombiner.java:168)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
>>> bine.reduce(PigCombiner.java:159)
>>>  at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com
>>> bine.reduce(PigCombiner.java:50)
>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>>  at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
>>> 227)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
>>> 8)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
>>> a:1135)
>>>
>>>
>>> Error: GC overhead limit exceeded
>>> -------
>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>  at
>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35
>>> )
>>>  at
>>> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55)
>>>  at
>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>>>  at
>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
>>>  at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
>>>  at
>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja
>>> va:114)
>>>  at
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
>>> eserialize(WritableSerialization.java:67)
>>>  at
>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d
>>> eserialize(WritableSerialization.java:40)
>>>  at
>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11
>>> 6)
>>>  at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>>>  at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1
>>> 227)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64
>>> 8)
>>>  at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav
>>> a:1135)
>>>
>>>
>>>
>>> On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote:
>>>
>>>> Syed,
>>>>
>>>> One line stack traces arent much helpful :) Please provide the full stack
>>>> trace and the pig script which produced it and we can take a look.
>>>>
>>>> Ashutosh
>>>> On Wed, Jul 7, 2010 at 14:09, Syed Wasti <mdwa...@hotmail.com> wrote:
>>>>
>>>>>
>>>>> I am running my Pig scripts on our QA cluster (with 4 datanoes, see 
>>>>> blelow)
>>>>> and has Cloudera CDH2 release installed and global heap max is ­Xmx4096m.I
>>>>> am
>>>>> constantly getting OutOfMemory errors (see below) on my map and reduce
>>>>> jobs, when I try run my script against large data where it produces around
>>>>> 600 maps.
>>>>> Looking for some tips on the best configuration for pig and to get rid of
>>>>> these errors. Thanks.
>>>>>
>>>>>
>>>>>
>>>>> Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
>>>>> heap space
>>>>>
>>>>> Regards
>>>>> Syed
>>>>>
>>>
>>>
>>>
>>
>
>
>

Reply via email to