HI all,
 
I had a Pig script that worked completely fine. I called a memory intensive UDF 
that brought some 600 MB data into each mapper. However, I was able to process 
and write results. My mapper memory is 4096 MB. My HDFS block size is 128 MB. 
 
My input dataset (on a given date) is big enough to cause some 960 mappers. 
 
A = load 'input data set' ..;
B = load 'smaller data set'..;
C = JOIN A by key, B by key using "replicated";
D = foreach C generate field1, MyUDF(field2) as field2;
E = store D into 'deleteme';
 
As you can see it is a Map only process. My output is some 960 part files with 
each file being around 25-35 MB.
 
I do processing for each day. I now have a requirement to merge the results of 
the above processing with results from another date and store unique results.
 
I added the following lines 
F = 'load previous date data'..;
G = union E, F;
H = distinct G parallel $X;
store H into 'deleteme_H';
 
When I add these steps to my process I get errors like "Java heap issue" in the 
mapper phase. I made F to be a null data set but I am still getting the same 
error. I wonder why I am getting "Java heap" errors. Is the solution to 
increase the mapper memory further down?
 
Thanks!   
 
 
 


      

Reply via email to