What exactly are the output_dir/part-00000 semantics (of a streaming job) ?

Dieter Plaetinck Thu, 12 May 2011 02:24:19 -0700

Hi,
I'm running some experiments using hadoop streaming.
I always get a output_dir/part-00000 file at the end, but I wonder:
when exactly will this filename show up? when it's completely written,
or will it already show up while the hapreduce software is still
writing to it? Is the write atomic?



The reason I'm asking this, I have a script which submits +- 200 of
jobs to mapreduce, and I have an other script collecting the
part-000000 files of all jobs. (not just once when all experiments are
done, but I frequently collect all results of thus far finished jobs)

For this, I just do (simplified code):

for i in $(seq 1 200); do
  if $(ssh $master "bin/hadoop dfs -ls $i/output/part-000000"); then
    ssh $master "bin/hadoop dfs -cat $i/output/part-000000" > output_$i
  fi
done

and I wonder if this is prone to race conditions, is there any change I
will run this while $i/output/part-000000 in the process of being
written to, and hence I end up with incomplete output_$i files?

If so, what's the proper way to check if the file is really "stable"?
fetching the jobtracker webpage and checking if job $i is finished?

Dieter

What exactly are the output_dir/part-00000 semantics (of a streaming job) ?

Reply via email to