Re: operation of DistributedCache following manual deletion of cached files?

Meng Mao Tue, 27 Sep 2011 09:17:57 -0700

Who is in charge of getting the files there for the first time? The
addCacheFile call in the mapreduce job? Or a manual setup by the
user/operator?


On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans <[email protected]> wrote:

> The problem is the step 4 in the breaking sequence.  Currently the
> TaskTracker never looks at the disk to know if a file is in the distributed
> cache or not.  It assumes that if it downloaded the file and did not delete
> that file itself then the file is still there in its original form.  It does
> not know that you deleted those files, or if wrote to the files, or in any
> way altered those files.  In general you should not be modifying those
> files.  This is not only because it messes up the tracking of those files,
> but because other jobs running concurrently with your task may also be using
> those files.
>
> --Bobby Evans
>
>
> On 9/26/11 4:40 PM, "Meng Mao" <[email protected]> wrote:
>
> Let's frame the issue in another way. I'll describe a sequence of Hadoop
> operations that I think should work, and then I'll get into what we did and
> how it failed.
>
> Normal sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Run Job A some time later. job runs fine again.
>
> Breaking sequence:
> 1. have files to be cached in HDFS
> 2. Run Job A, which specifies those files to be put into DistributedCache
> space
> 3. job runs fine
> 4. Manually delete cached files out of local disk on worker nodes
> 5. Run Job A again, expect it to push out cache copies as needed.
> 6. job fails because the cache copies didn't get distributed
>
> Should this second sequence have broken?
>
> On Fri, Sep 23, 2011 at 3:09 PM, Meng Mao <[email protected]> wrote:
>
> > Hmm, I must have really missed an important piece somewhere. This is from
> > the MapRed tutorial text:
> >
> > "DistributedCache is a facility provided by the Map/Reduce framework to
> > cache files (text, archives, jars and so on) needed by applications.
> >
> > Applications specify the files to be cached via urls (hdfs://) in the
> > JobConf. The DistributedCache* assumes that the files specified via
> > hdfs:// urls are already present on the FileSystem.*
> >
> > *The framework will copy the necessary files to the slave node before any
> > tasks for the job are executed on that node*. Its efficiency stems from
> > the fact that the files are only copied once per job and the ability to
> > cache archives which are un-archived on the slaves."
> >
> >
> > After some close reading, the two bolded pieces seem to be in
> contradiction
> > of each other? I'd always that addCacheFile() would perform the 2nd
> bolded
> > statement. If that sentence is true, then I still don't have an
> explanation
> > of why our job didn't correctly push out new versions of the cache files
> > upon the startup and execution of JobConfiguration. We deleted them
> before
> > our job started, not during.
> >
> > On Fri, Sep 23, 2011 at 9:35 AM, Robert Evans <[email protected]>
> wrote:
> >
> >> Meng Mao,
> >>
> >> The way the distributed cache is currently written, it does not verify
> the
> >> integrity of the cache files at all after they are downloaded.  It just
> >> assumes that if they were downloaded once they are still there and in
> the
> >> proper shape.  It might be good to file a JIRA to add in some sort of
> check.
> >>  Another thing to do is that the distributed cache also includes the
> time
> >> stamp of the original file, just incase you delete the file and then use
> a
> >> different version.  So if you want it to force a download again you can
> copy
> >> it delete the original and then move it back to what it was before.
> >>
> >> --Bobby Evans
> >>
> >> On 9/23/11 1:57 AM, "Meng Mao" <[email protected]> wrote:
> >>
> >> We use the DistributedCache class to distribute a few lookup files for
> our
> >> jobs. We have been aggressively deleting failed task attempts' leftover
> >> data
> >> , and our script accidentally deleted the path to our distributed cache
> >> files.
> >>
> >> Our task attempt leftover data was here [per node]:
> >> /hadoop/hadoop-metadata/cache/mapred/local/
> >> and our distributed cache path was:
> >> hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/<nameNode>
> >> We deleted this path by accident.
> >>
> >> Does this latter path look normal? I'm not that familiar with
> >> DistributedCache but I'm up right now investigating the issue so I
> thought
> >> I'd ask.
> >>
> >> After that deletion, the first 2 jobs to run (which are use the
> >> addCacheFile
> >> method to distribute their files) didn't seem to push the files out to
> the
> >> cache path, except on one node. Is this expected behavior? Shouldn't
> >> addCacheFile check to see if the files are missing, and if so,
> repopulate
> >> them as needed?
> >>
> >> I'm trying to get a handle on whether it's safe to delete the
> distributed
> >> cache path when the grid is quiet and no jobs are running. That is, if
> >> addCacheFile is designed to be robust against the files it's caching not
> >> being at each job start.
> >>
> >>
> >
>
>

Re: operation of DistributedCache following manual deletion of cached files?

Reply via email to