Hi Syed, Do you mean your query fails with OOME if you use Pig's builtin SUM, but succeeds if you use your own SUM UDF? If that is so, thats interesting. I have a hunch, why that is the case, but would like to confirm. Would you mind sharing your SUM UDF.
Ashutosh On Fri, Jul 9, 2010 at 12:50, Syed Wasti <mdwa...@hotmail.com> wrote: > Hi Ashutosh, > Did not try option 2 and 3, I shall work sometime next week on that. > But increasing the heap size did not help initially, with the increased heap > size I came up with a UDF to do the SUM on the grouped data for the last > step in my script and it completes my query without any errors now. > > Syed > > > On 7/8/10 5:58 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote: > >> Aah.. forgot to tell how to set that param in 3). While launching >> pig, provide it as -D cmd line switch, as follows: >> pig -Dpig.cachedbag.memusage=0.02f myscript.pig >> >> On Thu, Jul 8, 2010 at 17:45, Ashutosh Chauhan >> <ashutosh.chau...@gmail.com> wrote: >>> I will recommend following things in the order: >>> >>> 1) Increasing heap size should help. >>> 2) It seems you are on 0.7. There are couple of memory fixes we have >>> committed both on 0.7 branch as well as on trunk. Those should help as >>> well. So, build Pig either from trunk or 0.7 branch and use that. >>> 3) Only if these dont help, you can try tuning the param >>> pig.cachedbag.memusage. By default, it is set at 0.1, lowering it >>> should help. Try with 0.05, 0.02 and then further down. Downside is, >>> as you go lower and lower, it will make your query go slower. >>> >>> Let us know if these changes get your query to completion. >>> >>> Ashutosh >>> >>> On Thu, Jul 8, 2010 at 15:48, Syed Wasti <mdwa...@hotmail.com> wrote: >>>> Thanks Ashutosh, is there any workaround for this, will increasing the heap >>>> size help ? >>>> >>>> >>>> On 7/8/10 1:59 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote: >>>> >>>>> Syed, >>>>> >>>>> You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . >>>>> Your query and stacktrace look very similar to the one in the jira >>>>> ticket. This may get fixed by 0.8 release. >>>>> >>>>> Ashutosh >>>>> >>>>> On Thu, Jul 8, 2010 at 13:42, Syed Wasti <mdwa...@hotmail.com> wrote: >>>>>> Sorry about the delay, was held with different things. >>>>>> Here is the script and the errors below; >>>>>> >>>>>> AA = LOAD 'table1' USING PigStorage('\t') as >>>>>> (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); >>>>>> >>>>>> AB = FOREACH AA GENERATE ID, e, f, n,o; >>>>>> >>>>>> AC = FILTER AB BY o == 1; >>>>>> >>>>>> AD = GROUP AC BY (ID, b); >>>>>> >>>>>> AE = FOREACH AD { A = DISTINCT AC.d; >>>>>> GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) >>>>>> COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } >>>>>> >>>>>> The same steps are repeated to load 5 different tables and then a UNION >>>>>> is >>>>>> done on them. >>>>>> >>>>>> Final_res = UNION AE, AF, AG, AH, AI; >>>>>> >>>>>> The actual number of columns will be 15 here I am showing with one table. >>>>>> >>>>>> Final_table = FOREACH Final_res GENERATE ID, >>>>>> (type == 'S' AND b == 1?cnt:0) AS 12_tmp, >>>>>> (type == 'S' AND b == 2?cnt:0) AS 13_tmp, >>>>>> (type == 'S' AND b == 1?cnt_distinct:0) AS >>>>>> 12_distinct_tmp, >>>>>> (type == 'S' AND b == 2?cnt_distinct:0) AS >>>>>> 13_distinct_tmp; >>>>>> >>>>>> It works fine until here, it is only after adding this last part of the >>>>>> query it starts throwing heap errors. >>>>>> >>>>>> grp_id = GROUP Final_table BY ID; >>>>>> >>>>>> Final_data = FOREACH grp_reg_id GENERATE group AS ID >>>>>> SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), >>>>>> SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); >>>>>> >>>>>> STORE Final_data; >>>>>> >>>>>> >>>>>> Error: java.lang.OutOfMemoryError: Java heap space >>>>>> at java.util.ArrayList.(ArrayList.java:112) >>>>>> at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) >>>>>> at >>>>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java: >>>>>> 35 >>>>>> ) >>>>>> at >>>>>> > org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>> > ) >>>>>> at >>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) >>>>>> at >>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) >>>>>> at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) >>>>>> at >>>>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable. >>>>>> ja >>>>>> va:114) >>>>>> at >>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer >>>>>> .d >>>>>> eserialize(WritableSerialization.java:67) >>>>>> at >>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer >>>>>> .d >>>>>> eserialize(WritableSerialization.java:40) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java: >>>>>> 11 >>>>>> 6) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) >>>>>> at >>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java >>>>>> :1 >>>>>> 227) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java: >>>>>> 64 >>>>>> 8) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j >>>>>> av >>>>>> a:1135) >>>>>> >>>>>> >>>>>> Error: java.lang.OutOfMemoryError: Java heap space >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper >>>>>> at >>>>>> ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper >>>>>> at >>>>>> ors.POCombinerPackage.getNext(POCombinerPackage.java:148) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C >>>>>> om >>>>>> bine.processOnePackageOutput(PigCombiner.java:168) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C >>>>>> om >>>>>> bine.reduce(PigCombiner.java:159) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C >>>>>> om >>>>>> bine.reduce(PigCombiner.java:50) >>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >>>>>> at >>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java >>>>>> :1 >>>>>> 227) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java: >>>>>> 64 >>>>>> 8) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j >>>>>> av >>>>>> a:1135) >>>>>> >>>>>> >>>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>>> at java.util.AbstractList.iterator(AbstractList.java:273) >>>>>> at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185) >>>>>> at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOper >>>>>> at >>>>>> ors.POCombinerPackage.getNext(POCombinerPackage.java:168) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C >>>>>> om >>>>>> bine.processOnePackageOutput(PigCombiner.java:168) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C >>>>>> om >>>>>> bine.reduce(PigCombiner.java:159) >>>>>> at >>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$C >>>>>> om >>>>>> bine.reduce(PigCombiner.java:50) >>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >>>>>> at >>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java >>>>>> :1 >>>>>> 227) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java: >>>>>> 64 >>>>>> 8) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j >>>>>> av >>>>>> a:1135) >>>>>> >>>>>> >>>>>> Error: GC overhead limit exceeded >>>>>> ------- >>>>>> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded >>>>>> at >>>>>> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java: >>>>>> 35 >>>>>> ) >>>>>> at >>>>>> > org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55>>>>> > ) >>>>>> at >>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) >>>>>> at >>>>>> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) >>>>>> at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) >>>>>> at >>>>>> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable. >>>>>> ja >>>>>> va:114) >>>>>> at >>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer >>>>>> .d >>>>>> eserialize(WritableSerialization.java:67) >>>>>> at >>>>>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer >>>>>> .d >>>>>> eserialize(WritableSerialization.java:40) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java: >>>>>> 11 >>>>>> 6) >>>>>> at >>>>>> org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) >>>>>> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) >>>>>> at >>>>>> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java >>>>>> :1 >>>>>> 227) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java: >>>>>> 64 >>>>>> 8) >>>>>> at >>>>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.j >>>>>> av >>>>>> a:1135) >>>>>> >>>>>> >>>>>> >>>>>> On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote: >>>>>> >>>>>>> Syed, >>>>>>> >>>>>>> One line stack traces arent much helpful :) Please provide the full >>>>>>> stack >>>>>>> trace and the pig script which produced it and we can take a look. >>>>>>> >>>>>>> Ashutosh >>>>>>> On Wed, Jul 7, 2010 at 14:09, Syed Wasti <mdwa...@hotmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> I am running my Pig scripts on our QA cluster (with 4 datanoes, see >>>>>>>> blelow) >>>>>>>> and has Cloudera CDH2 release installed and global heap max is >>>>>>>> Xmx4096m.I >>>>>>>> am >>>>>>>> constantly getting OutOfMemory errors (see below) on my map and reduce >>>>>>>> jobs, when I try run my script against large data where it produces >>>>>>>> around >>>>>>>> 600 maps. >>>>>>>> Looking for some tips on the best configuration for pig and to get rid >>>>>>>> of >>>>>>>> these errors. Thanks. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: >>>>>>>> Java >>>>>>>> heap space >>>>>>>> >>>>>>>> Regards >>>>>>>> Syed >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> > > >