Matteo, It depends on how many reduce slots you actually have. The number of reduce slots available on the cluster is configured at the hadoop level. What you are controlling with the "Parallel" keyword is the number of reducers you want to use.
If you use more than available slots, you will have extra waves of reduces -- which may be what you want if your reduce jobs are big. In that case, reduce tasks that didn't have available slots for them will start as soon as reduce tasks ahead of them in the queue finish and free up capacity. Naturally, this means not everything is getting run in parallel at once. Again, depending on the nature of the job and the size of the cluster, this may be desirable. If you use fewer reducers than available slots, you will have slots left over for other jobs, or for restarting failed reduce tasks if something goes wrong, speculative execution kicks in, etc). Regards, -Dmitriy On Tue, Nov 17, 2009 at 8:17 AM, Matteo Nasi <[email protected]> wrote: > thanks so much, and sorry for my inexperience ... > I was also using this parallel instruction in my script beside instructions > like GROUP, ORDER, ... to increase parallelism. And I'm trying to get the > best value to use this scripts in my small and poor cluster. In this first > days of jobs , I set parallel up to 10. This would be foolowing the > cookbook <num machines> * <num reduce slots per machine> * 0.9 formula about > 2 reduce slots per machine, what do you think ? > > thaks again. > > Matteo > > On Tue, Nov 17, 2009 at 4:54 PM, Bennie Schut <[email protected]> wrote: > >> Yes it normal, once you're used to it it's not so bad. The same will >> happen when you write a custom(non-pig) mapreduce job. >> In pig you can use the "parallel 4;" syntax to specify the number of >> reducers and thus the number of output files. >> >> zaki rahaman wrote: >> > Hi Matteo, >> > >> > This is completely normal. Someone else can correct me if I'm wrong, but >> > from my understanding, the number of part-000* files corresponds to the >> > number of reducers you end up having for your cluster. These can be empty >> > and others will indeed contain the data you want. What I usually do is >> run a >> > small script to collect the output data and format it appropriately... >> you >> > could just do something as simple as cat output/* ... and I'm not sure >> this >> > behavior is going to be changed anytime soon. >> > >> > On Tue, Nov 17, 2009 at 10:47 AM, Matteo Nasi <[email protected]> >> wrote: >> > >> > >> >> hi all, >> >> >> >> I'm knew to Hadoop. Found Pig very quick and easy to learn, made my own >> >> simple scripts and my first own UDF. >> >> it's a loader UDF based on the piggybank samples found on 0.5.0 folders, >> it >> >> basically load data based on a fixed pattern, specified with regex code. >> >> everything is fine when run with -x local mode, I can manipulate input >> and >> >> generate output as well, output is only one part-00000 file containing >> data >> >> in the format I desired them. >> >> when I try to run it on my 5 nodes hadoop cluster with -x mapred option >> >> (0.20.1, same for 0.18.3 and old 0.4.0 pig) I have a strange behaviour >> for >> >> my output folder... >> >> there are more than one part-* file, some of them are empty ... some >> others >> >> contain the data I found before on the local run but splitted in >> different >> >> files ... >> >> >> >> first question is: why ? is it normal to have output splitted into >> >> different >> >> part-00000 in the cluster execution of a script ? >> >> >> >> if there's nothing to do with it I can always reassemble them into a >> unique >> >> file with a perl or bash script after the copytoLocal operation but it >> >> doesn't seem too nice for me :-( >> >> >> >> thanks in advance for any suggestion. >> >> >> >> Matteo >> >> >> >> >> > >> > >> > >> > >> >> >
