Re: output file splitted

Matteo Nasi Tue, 17 Nov 2009 08:17:53 -0800

thanks so much, and sorry for my inexperience ...
I was also using this parallel instruction in my script beside instructions
like GROUP, ORDER, ... to increase parallelism. And I'm trying to get the
best value to use this scripts in my small and poor cluster. In this first
days of jobs , I set parallel up to 10. This would be foolowing the
cookbook <num machines> * <num reduce slots per machine> * 0.9 formula about
2 reduce slots per machine, what do you think ?


thaks again.

Matteo

On Tue, Nov 17, 2009 at 4:54 PM, Bennie Schut <[email protected]> wrote:

> Yes it normal, once you're used to it it's not so bad. The same will
> happen when you write a custom(non-pig) mapreduce job.
> In pig you can use the "parallel 4;" syntax to specify the number of
> reducers and thus the number of output files.
>
> zaki rahaman wrote:
> > Hi Matteo,
> >
> > This is completely normal. Someone else can correct me if I'm wrong, but
> > from my understanding, the number of part-000* files corresponds to the
> > number of reducers you end up having for your cluster. These can be empty
> > and others will indeed contain the data you want. What I usually do is
> run a
> > small script to collect the output data and format it appropriately...
> you
> > could just do something as simple as cat output/* ... and I'm not sure
> this
> > behavior is going to be changed anytime soon.
> >
> > On Tue, Nov 17, 2009 at 10:47 AM, Matteo Nasi <[email protected]>
> wrote:
> >
> >
> >> hi all,
> >>
> >> I'm knew to Hadoop. Found Pig very quick and easy to learn, made my own
> >> simple scripts and my first own UDF.
> >> it's a loader UDF based on the piggybank samples found on 0.5.0 folders,
> it
> >> basically load data based on a fixed pattern, specified with regex code.
> >> everything is fine when run with -x local mode, I can manipulate input
> and
> >> generate output as well, output is only one part-00000 file containing
> data
> >> in the format I desired them.
> >> when I try to run it on my 5 nodes hadoop cluster with -x mapred option
> >> (0.20.1, same for 0.18.3 and old 0.4.0 pig) I have a strange behaviour
> for
> >> my output folder...
> >> there are more than one part-* file, some of them are empty ... some
> others
> >> contain the data I found before on the local run but splitted in
> different
> >> files ...
> >>
> >> first question is: why ? is it normal to have output splitted into
> >> different
> >> part-00000 in the cluster execution of a script ?
> >>
> >> if there's nothing to do with it I can always reassemble them into a
> unique
> >> file with a perl or bash script after the copytoLocal operation but it
> >> doesn't seem too nice for me :-(
> >>
> >> thanks in advance for any suggestion.
> >>
> >> Matteo
> >>
> >>
> >
> >
> >
> >
>
>

Re: output file splitted

Reply via email to