This is a question I should go and test out myself but was wondering if anyone has a quick answer.
We have map/reduce jobs that produce lots of smaller files to a folder. We also have a hive external table pointed at this folder. We have a tool FileCrusher which is made to bunch up multiple small files TEXT,and SEQUENCE into 1 large file. (which we are going to open source to help people with lots of file problems) It is launched something like this FileCrusher /src/folder. This process builds one large file in a temp directory, then once done moves the old files to a junk folder and moves the new file into the /src/folder What I am looking to figure out is, if a map reduce job is started before the files are moved, the splits are calculated and the job is running, what will happen if I then move the files in /src/folder and replace with a new file. I am hoping that since the splits are associated with blocks that the Job will produce correct results no matter what time the files are moved. In other works after split calculate the job should be "atomic". Regards, Edward
