Hi Daniel, You're correct that the distinct statement is causing the issue, because if I comment the distinct the script runs fine. However, I ran the script with the -Dpig.exec.nocombiner=true option but still I got the "Java heap issue" error in the mapper. Any idea, why?
Thanks! --- On Fri, 5/7/10, Daniel Dai <[email protected]> wrote: From: Daniel Dai <[email protected]> Subject: Re: Java heap issue To: "[email protected]" <[email protected]> Date: Friday, May 7, 2010, 10:58 PM I suspect it is because of the distinct combiner. Try the option -Dpig.exec.nocombiner=true on the command line, see if it works. Daniel Kelvin Moss wrote: > > HI all, > I had a Pig script that worked completely fine. I called a memory intensive >UDF that brought some 600 MB data into each mapper. However, I was able to >process and write results. My mapper memory is 4096 MB. My HDFS block size is >128 MB. My input dataset (on a given date) is big enough to cause some 960 >mappers. A = load 'input data set' ..; > B = load 'smaller data set'..; > C = JOIN A by key, B by key using "replicated"; > D = foreach C generate field1, MyUDF(field2) as field2; > E = store D into 'deleteme'; > As you can see it is a Map only process. My output is some 960 part files >with each file being around 25-35 MB. > I do processing for each day. I now have a requirement to merge the results >of the above processing with results from another date and store unique >results. > I added the following lines F = 'load previous date data'..; > G = union E, F; > H = distinct G parallel $X; > store H into 'deleteme_H'; > When I add these steps to my process I get errors like "Java heap issue" in >the mapper phase. I made F to be a null data set but I am still getting the >same error. I wonder why I am getting "Java heap" errors. Is the solution to >increase the mapper memory further down? > Thanks! > >
