Hi Matteo, This is completely normal. Someone else can correct me if I'm wrong, but from my understanding, the number of part-000* files corresponds to the number of reducers you end up having for your cluster. These can be empty and others will indeed contain the data you want. What I usually do is run a small script to collect the output data and format it appropriately... you could just do something as simple as cat output/* ... and I'm not sure this behavior is going to be changed anytime soon.
On Tue, Nov 17, 2009 at 10:47 AM, Matteo Nasi <[email protected]> wrote: > hi all, > > I'm knew to Hadoop. Found Pig very quick and easy to learn, made my own > simple scripts and my first own UDF. > it's a loader UDF based on the piggybank samples found on 0.5.0 folders, it > basically load data based on a fixed pattern, specified with regex code. > everything is fine when run with -x local mode, I can manipulate input and > generate output as well, output is only one part-00000 file containing data > in the format I desired them. > when I try to run it on my 5 nodes hadoop cluster with -x mapred option > (0.20.1, same for 0.18.3 and old 0.4.0 pig) I have a strange behaviour for > my output folder... > there are more than one part-* file, some of them are empty ... some others > contain the data I found before on the local run but splitted in different > files ... > > first question is: why ? is it normal to have output splitted into > different > part-00000 in the cluster execution of a script ? > > if there's nothing to do with it I can always reassemble them into a unique > file with a perl or bash script after the copytoLocal operation but it > doesn't seem too nice for me :-( > > thanks in advance for any suggestion. > > Matteo > -- Zaki Rahaman
