Hi Matteo,

This is completely normal. Someone else can correct me if I'm wrong, but
from my understanding, the number of part-000* files corresponds to the
number of reducers you end up having for your cluster. These can be empty
and others will indeed contain the data you want. What I usually do is run a
small script to collect the output data and format it appropriately... you
could just do something as simple as cat output/* ... and I'm not sure this
behavior is going to be changed anytime soon.

On Tue, Nov 17, 2009 at 10:47 AM, Matteo Nasi <[email protected]> wrote:

> hi all,
>
> I'm knew to Hadoop. Found Pig very quick and easy to learn, made my own
> simple scripts and my first own UDF.
> it's a loader UDF based on the piggybank samples found on 0.5.0 folders, it
> basically load data based on a fixed pattern, specified with regex code.
> everything is fine when run with -x local mode, I can manipulate input and
> generate output as well, output is only one part-00000 file containing data
> in the format I desired them.
> when I try to run it on my 5 nodes hadoop cluster with -x mapred option
> (0.20.1, same for 0.18.3 and old 0.4.0 pig) I have a strange behaviour for
> my output folder...
> there are more than one part-* file, some of them are empty ... some others
> contain the data I found before on the local run but splitted in different
> files ...
>
> first question is: why ? is it normal to have output splitted into
> different
> part-00000 in the cluster execution of a script ?
>
> if there's nothing to do with it I can always reassemble them into a unique
> file with a perl or bash script after the copytoLocal operation but it
> doesn't seem too nice for me :-(
>
> thanks in advance for any suggestion.
>
> Matteo
>



-- 
Zaki Rahaman

Reply via email to