Hi, I'm running some experiments using hadoop streaming. I always get a output_dir/part-00000 file at the end, but I wonder: when exactly will this filename show up? when it's completely written, or will it already show up while the hapreduce software is still writing to it? Is the write atomic?
The reason I'm asking this, I have a script which submits +- 200 of jobs to mapreduce, and I have an other script collecting the part-000000 files of all jobs. (not just once when all experiments are done, but I frequently collect all results of thus far finished jobs) For this, I just do (simplified code): for i in $(seq 1 200); do if $(ssh $master "bin/hadoop dfs -ls $i/output/part-000000"); then ssh $master "bin/hadoop dfs -cat $i/output/part-000000" > output_$i fi done and I wonder if this is prone to race conditions, is there any change I will run this while $i/output/part-000000 in the process of being written to, and hence I end up with incomplete output_$i files? If so, what's the proper way to check if the file is really "stable"? fetching the jobtracker webpage and checking if job $i is finished? Dieter