Syed, You are likely hit by https://issues.apache.org/jira/browse/PIG-1442 . Your query and stacktrace look very similar to the one in the jira ticket. This may get fixed by 0.8 release.
Ashutosh On Thu, Jul 8, 2010 at 13:42, Syed Wasti <mdwa...@hotmail.com> wrote: > Sorry about the delay, was held with different things. > Here is the script and the errors below; > > AA = LOAD 'table1' USING PigStorage('\t') as > (ID,b,c,d,e,f,g,h,i,j,k,l,m,n,o); > > AB = FOREACH AA GENERATE ID, e, f, n,o; > > AC = FILTER AB BY o == 1; > > AD = GROUP AC BY (ID, b); > > AE = FOREACH AD { A = DISTINCT AC.d; > GENERATE group.ID, (chararray) 'S' AS type, group.b, (int) > COUNT_STAR(filt) AS cnt, (int) COUNT(A) AS cnt_distinct; } > > The same steps are repeated to load 5 different tables and then a UNION is > done on them. > > Final_res = UNION AE, AF, AG, AH, AI; > > The actual number of columns will be 15 here I am showing with one table. > > Final_table = FOREACH Final_res GENERATE ID, > (type == 'S' AND b == 1?cnt:0) AS 12_tmp, > (type == 'S' AND b == 2?cnt:0) AS 13_tmp, > (type == 'S' AND b == 1?cnt_distinct:0) AS 12_distinct_tmp, > (type == 'S' AND b == 2?cnt_distinct:0) AS 13_distinct_tmp; > > It works fine until here, it is only after adding this last part of the > query it starts throwing heap errors. > > grp_id = GROUP Final_table BY ID; > > Final_data = FOREACH grp_reg_id GENERATE group AS ID > SUM(Final_table.12_tmp), SUM(Final_table.13_tmp), > SUM(Final_table.12_distinct_tmp), SUM(Final_table.13_distinct_tmp); > > STORE Final_data; > > > Error: java.lang.OutOfMemoryError: Java heap space > at java.util.ArrayList.(ArrayList.java:112) > at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:63) > at > org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 > ) > at > org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) > at > org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) > at > org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) > at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) > at > org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja > va:114) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d > eserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d > eserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 > 6) > at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) > at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 > 227) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 > 8) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav > a:1135) > > > Error: java.lang.OutOfMemoryError: Java heap space > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POCombinerPackage.createDataBag(POCombinerPackage.java:139) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POCombinerPackage.getNext(POCombinerPackage.java:148) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.processOnePackageOutput(PigCombiner.java:168) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.reduce(PigCombiner.java:159) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.reduce(PigCombiner.java:50) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 > 227) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 > 8) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav > a:1135) > > > Error: java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.AbstractList.iterator(AbstractList.java:273) > at org.apache.pig.data.DefaultTuple.getMemorySize(DefaultTuple.java:185) > at org.apache.pig.data.InternalCachedBag.add(InternalCachedBag.java:89) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperat > ors.POCombinerPackage.getNext(POCombinerPackage.java:168) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.processOnePackageOutput(PigCombiner.java:168) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.reduce(PigCombiner.java:159) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Com > bine.reduce(PigCombiner.java:50) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 > 227) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 > 8) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav > a:1135) > > > Error: GC overhead limit exceeded > ------- > Error: java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35 > ) > at > org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:55) > at > org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) > at > org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:130) > at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:289) > at > org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.ja > va:114) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d > eserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.d > eserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:11 > 6) > at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) > at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 > 227) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:64 > 8) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav > a:1135) > > > > On 7/7/10 5:50 PM, "Ashutosh Chauhan" <ashutosh.chau...@gmail.com> wrote: > >> Syed, >> >> One line stack traces arent much helpful :) Please provide the full stack >> trace and the pig script which produced it and we can take a look. >> >> Ashutosh >> On Wed, Jul 7, 2010 at 14:09, Syed Wasti <mdwa...@hotmail.com> wrote: >> >>> >>> I am running my Pig scripts on our QA cluster (with 4 datanoes, see blelow) >>> and has Cloudera CDH2 release installed and global heap max is Xmx4096m.I >>> am >>> constantly getting OutOfMemory errors (see below) on my map and reduce >>> jobs, when I try run my script against large data where it produces around >>> 600 maps. >>> Looking for some tips on the best configuration for pig and to get rid of >>> these errors. Thanks. >>> >>> >>> >>> Error: GC overhead limit exceededError: java.lang.OutOfMemoryError: Java >>> heap space >>> >>> Regards >>> Syed >>> > > >