Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator?
On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans <[email protected]> wrote: > The problem is the step 4 in the breaking sequence. Currently the > TaskTracker never looks at the disk to know if a file is in the distributed > cache or not. It assumes that if it downloaded the file and did not delete > that file itself then the file is still there in its original form. It does > not know that you deleted those files, or if wrote to the files, or in any > way altered those files. In general you should not be modifying those > files. This is not only because it messes up the tracking of those files, > but because other jobs running concurrently with your task may also be using > those files. > > --Bobby Evans > > > On 9/26/11 4:40 PM, "Meng Mao" <[email protected]> wrote: > > Let's frame the issue in another way. I'll describe a sequence of Hadoop > operations that I think should work, and then I'll get into what we did and > how it failed. > > Normal sequence: > 1. have files to be cached in HDFS > 2. Run Job A, which specifies those files to be put into DistributedCache > space > 3. job runs fine > 4. Run Job A some time later. job runs fine again. > > Breaking sequence: > 1. have files to be cached in HDFS > 2. Run Job A, which specifies those files to be put into DistributedCache > space > 3. job runs fine > 4. Manually delete cached files out of local disk on worker nodes > 5. Run Job A again, expect it to push out cache copies as needed. > 6. job fails because the cache copies didn't get distributed > > Should this second sequence have broken? > > On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao <[email protected]> wrote: > > > Hmm, I must have really missed an important piece somewhere. This is from > > the MapRed tutorial text: > > > > "DistributedCache is a facility provided by the Map/Reduce framework to > > cache files (text, archives, jars and so on) needed by applications. > > > > Applications specify the files to be cached via urls (hdfs://) in the > > JobConf. The DistributedCache* assumes that the files specified via > > hdfs:// urls are already present on the FileSystem.* > > > > *The framework will copy the necessary files to the slave node before any > > tasks for the job are executed on that node*. Its efficiency stems from > > the fact that the files are only copied once per job and the ability to > > cache archives which are un-archived on the slaves." > > > > > > After some close reading, the two bolded pieces seem to be in > contradiction > > of each other? I'd always that addCacheFile() would perform the 2nd > bolded > > statement. If that sentence is true, then I still don't have an > explanation > > of why our job didn't correctly push out new versions of the cache files > > upon the startup and execution of JobConfiguration. We deleted them > before > > our job started, not during. > > > > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans <[email protected]> > wrote: > > > >> Meng Mao, > >> > >> The way the distributed cache is currently written, it does not verify > the > >> integrity of the cache files at all after they are downloaded. It just > >> assumes that if they were downloaded once they are still there and in > the > >> proper shape. It might be good to file a JIRA to add in some sort of > check. > >> Another thing to do is that the distributed cache also includes the > time > >> stamp of the original file, just incase you delete the file and then use > a > >> different version. So if you want it to force a download again you can > copy > >> it delete the original and then move it back to what it was before. > >> > >> --Bobby Evans > >> > >> On 9/23/11 1:57 AM, "Meng Mao" <[email protected]> wrote: > >> > >> We use the DistributedCache class to distribute a few lookup files for > our > >> jobs. We have been aggressively deleting failed task attempts' leftover > >> data > >> , and our script accidentally deleted the path to our distributed cache > >> files. > >> > >> Our task attempt leftover data was here [per node]: > >> /hadoop/hadoop-metadata/cache/mapred/local/ > >> and our distributed cache path was: > >> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/<nameNode> > >> We deleted this path by accident. > >> > >> Does this latter path look normal? I'm not that familiar with > >> DistributedCache but I'm up right now investigating the issue so I > thought > >> I'd ask. > >> > >> After that deletion, the first 2 jobs to run (which are use the > >> addCacheFile > >> method to distribute their files) didn't seem to push the files out to > the > >> cache path, except on one node. Is this expected behavior? Shouldn't > >> addCacheFile check to see if the files are missing, and if so, > repopulate > >> them as needed? > >> > >> I'm trying to get a handle on whether it's safe to delete the > distributed > >> cache path when the grid is quiet and no jobs are running. That is, if > >> addCacheFile is designed to be robust against the files it's caching not > >> being at each job start. > >> > >> > > > >
