Hi Ashutosh,
Did not try option 2 and 3, I shall work sometime next week on that.
But increasing the heap size did not help initially, with the increased heap
size I came up with a UDF to do the SUM on the grouped data for the last
step in my script and it completes my query without any errors now.

Syed


On 7/8/10 5:58 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote:

> Aah.. forgot to tell how to set that param  in 3). While launching
> pig, provide it as -D cmd line switch, as follows:
> pig -Dpig.cachedbag.memusage=0.02f myscript.pig
> 
> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan
> <ashutosh.chau...@gmail.com> wrote:
>> I will recommend following things in the order:
>> 
>> 1) Increasing heap size should help.
>> 2) It seems you are on 0.7. There are couple of memory fixes we have
>> committed both on 0.7 branch as well as on trunk. Those should help as
>> well. So, build Pig either from trunk or 0.7 branch and use that.
>> 3) Only if these dont help, you can try tuning the param
>> pig.cachedbag.memusage. By default, it is set at 0.1, lowering it
>> should help. Try with 0.05, 0.02 and then further down. Downside is,
>> as you go lower and lower, it will make your query go slower.
>> 
>> Let us know if these changes get your query to completion.
>> 
>> Ashutosh
>> 
>> On Thu, Jul 8, 2010 at 15:48, Syed Wasti <mdwa...@hotmail.com> wrote:
>>> Thanks Ashutosh, is there any workaround for this, will increasing the heap
>>> size help ?
>>> 
>>> 
>>> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote:
>>> 
>>>> Syed,
>>>> 
>>>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 .
>>>> Your query and stacktrace look very similar to the one in the jira
>>>> ticket. This may get fixed by 0.8 release.
>>>> 
>>>> Ashutosh
>>>> 
>>>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <mdwa...@hotmail.com> wrote:
>>>>> Sorry about the delay, was held with different things.
>>>>> Here is the script and the errors below;
>>>>> 
>>>>> AA = LOAD 'table1' USING PigStorage('\t') as
>>>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>>>>> 
>>>>> AB = FOREACH AA GENERATE ID, e, f, n,o;
>>>>> 
>>>>> AC = FILTER AB BY o == 1;
>>>>> 
>>>>> AD = GROUP AC BY (ID, b);
>>>>> 
>>>>> AE = FOREACH AD { A = DISTINCT AC.d;
>>>>>        GENERATE group.ID, (chararray) 'S' AS type, group.b, (int)
>>>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; }
>>>>> 
>>>>> The same steps are repeated to load 5 different tables and then a UNION is
>>>>> done on them.
>>>>> 
>>>>> Final_res = UNION AE, AF, AG, AH, AI;
>>>>> 
>>>>> The actual number of columns will be 15 here I am showing with one table.
>>>>> 
>>>>> Final_table =   FOREACH Final_res GENERATE ID,
>>>>>                (type == 'S' AND b == 1?cnt:0) AS 12_tmp,
>>>>>                (type == 'S' AND b == 2?cnt:0) AS 13_tmp,
>>>>>                (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp,
>>>>>                (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp;
>>>>> 
>>>>> It works fine until here, it is only after adding this last part of the
>>>>> query it starts throwing heap errors.
>>>>> 
>>>>> grp_id =    GROUP Final_table BY ID;
>>>>> 
>>>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID
>>>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp),
>>>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp);
>>>>> 
>>>>> STORE Final_data;
>>>>> 
>>>>> 
>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>  at java.util.ArrayList.(ArrayList.java:112)
>>>>>  at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63)
>>>>>  at
>>>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:
>>>>> 35
>>>>> )
>>>>>  at
>>>>> 
org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>>
)
>>>>>  at
>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>>>>>  at
>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
>>>>>  at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
>>>>>  at
>>>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.
>>>>> ja
>>>>> va:114)
>>>>>  at
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
>>>>> .d
>>>>> eserialize(WritableSerialization.java:67)
>>>>>  at
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
>>>>> .d
>>>>> eserialize(WritableSerialization.java:40)
>>>>>  at
>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
>>>>> 11
>>>>> 6)
>>>>>  at 
>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>>>>>  at 
>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
>>>>> :1
>>>>> 227)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
>>>>> 64
>>>>> 8)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
>>>>> av
>>>>> a:1135)
>>>>> 
>>>>> 
>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
>>>>> at
>>>>> ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
>>>>> at
>>>>> ors.POCombinerPackage.getNext(POCombinerPackage.java:148)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
>>>>> om
>>>>> bine.processOnePackageOutput(PigCombiner.java:168)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
>>>>> om
>>>>> bine.reduce(PigCombiner.java:159)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
>>>>> om
>>>>> bine.reduce(PigCombiner.java:50)
>>>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>>>>  at 
>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
>>>>> :1
>>>>> 227)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
>>>>> 64
>>>>> 8)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
>>>>> av
>>>>> a:1135)
>>>>> 
>>>>> 
>>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>  at java.util.AbstractList.iterator(AbstractList.java:273)
>>>>>  at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185)
>>>>>  at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper
>>>>> at
>>>>> ors.POCombinerPackage.getNext(POCombinerPackage.java:168)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
>>>>> om
>>>>> bine.processOnePackageOutput(PigCombiner.java:168)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
>>>>> om
>>>>> bine.reduce(PigCombiner.java:159)
>>>>>  at
>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C
>>>>> om
>>>>> bine.reduce(PigCombiner.java:50)
>>>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>>>>  at 
>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
>>>>> :1
>>>>> 227)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
>>>>> 64
>>>>> 8)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
>>>>> av
>>>>> a:1135)
>>>>> 
>>>>> 
>>>>> Error: GC overhead limit exceeded
>>>>> -------
>>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>>>  at
>>>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:
>>>>> 35
>>>>> )
>>>>>  at
>>>>> 
org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>>
)
>>>>>  at
>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>>>>>  at
>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130)
>>>>>  at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289)
>>>>>  at
>>>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.
>>>>> ja
>>>>> va:114)
>>>>>  at
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
>>>>> .d
>>>>> eserialize(WritableSerialization.java:67)
>>>>>  at
>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer
>>>>> .d
>>>>> eserialize(WritableSerialization.java:40)
>>>>>  at
>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
>>>>> 11
>>>>> 6)
>>>>>  at 
>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>>>>>  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
>>>>>  at 
>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java
>>>>> :1
>>>>> 227)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:
>>>>> 64
>>>>> 8)
>>>>>  at
>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j
>>>>> av
>>>>> a:1135)
>>>>> 
>>>>> 
>>>>> 
>>>>> On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote:
>>>>> 
>>>>>> Syed,
>>>>>> 
>>>>>> One line stack traces arent much helpful :) Please provide the full stack
>>>>>> trace and the pig script which produced it and we can take a look.
>>>>>> 
>>>>>> Ashutosh
>>>>>> On Wed, Jul 7, 2010 at 14:09, Syed Wasti <mdwa...@hotmail.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> I am running my Pig scripts on our QA cluster (with 4 datanoes, see
>>>>>>> blelow)
>>>>>>> and has Cloudera CDH2 release installed and global heap max is
>>>>>>> ­Xmx4096m.I
>>>>>>> am
>>>>>>> constantly getting OutOfMemory errors (see below) on my map and reduce
>>>>>>> jobs, when I try run my script against large data where it produces
>>>>>>> around
>>>>>>> 600 maps.
>>>>>>> Looking for some tips on the best configuration for pig and to get rid
>>>>>>> of
>>>>>>> these errors. Thanks.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java
>>>>>>> heap space
>>>>>>> 
>>>>>>> Regards
>>>>>>> Syed
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 


Reply via email to