Hi, > From the book: "Hadoop The definitive guide" -- P242 >>> > When you launch a job, Hadoop copies the files specified by the -files and > -archives options to the jobtracker’s filesystem (normally HDFS). Then, > before a task > is run, the tasktracker copies the files from the jobtracker’s filesystem to > a local disk— > the cache—so the task can access the files. >>> > > I wonder why hadoop wants to copy the files to jobtracker's filesystem. > Since it is already in HDFS, it should be available to tasks. > Any considerations?
Unlike input data files for M/R tasks, -files and -archives are options to copy additional files (like any configuration files etc) that all the M/R tasks might need when running. Such files typically need to be transferred from the local machine where the job is launched to the cluster nodes where the tasks run. Think of them as convenient shortcuts to distribute files to all the tasks. Makes sense ? Thanks Hemanth
