[
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898835#action_12898835
]
Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------
the proposal is below - rephrases some of the discussions above, addresses some
of the comments around race conditions and points out limitations. Junjie will
post a patch tomorrow (which probably needs some more work).
h4. Background
Hadoop map-reduce jobs commonly require jars, executables, archives and other
resources for task execution on hadoop cluster nodes. A common deployment
pattern for Hadoop applications is that the required resources are deployed
centrally by administrators (either on a shared file system or deployed on
standard local file system paths by package management tools). Users launch
Hadoop jobs from these installation points. Applications use apis
(-libjars/files/archives) provided by Hadoop to upload resources (from the
installation point) so that they are made available for task execution. This
behavior makes deployment of Hadoop applications very easy (just use standard
package management tools).
As an example, Facebook has a few different Hive installations (of different
versions) deployed on NFS filer. Each has a multitude of jar files - with only
some differing across different Hive versions. Users also maintain a repository
of map-reduce scripts and jar files contain Hive extensions (user defined
functions) on a NFS filer. Any installation of Hive can be used to execute jobs
against any of multiple map-reduce clusters. Most of the jar files are also
required locally (by the Hive client) - either for query compilation or for
local execution (either in hadoop local mode or for some special types of
queries).
h4. Problems
With the above arrangement - each (non local-mode) Hadoop job will upload all
the required jar files into HDFS. TaskTrackers will download these jars from
HDFS (at most once per job) and check modification times of downloaded files
(second task onwards) The following overheads are observed:
- Job submission latency is impacted because of the need to serially upload
multiple jar files into HDFS. At Facebook - we typically see 5-6 seconds of
pause in this stage (depends on how responsive DFS is on a given day)
- There is some latency in setting up the first task as resources must be
downloaded from HDFS. We have typically observed this to be around 2-3 seconds
at Facebook.
- For subsequent tasks - the latency impact is not as high - but the mtime
check adds to general Namenode pressure.
h4. Observations
- jars and other resources are shared across different jobs and users. there
are, in fact, hardly any resources that are not shared.
- these resources are meant to be immutable
We would like to use these properties to solve some of the overheads in the
current protocol while retaining the simplicity of the deployment model that
exists today.
h4. General Approach
We would like to introduce (for lack of a better term) the notion of Content
Addressible Resources (CAR) that are stored in a central repository in HDFS:
# CAR jars/files/archives are be identified by their content (for example -
named using their md5 checksum).\\
\\
This allows different jobs to share resources. Each Job can find out
whether the resources required by it are already available in HDFS (by
comparing the md5 signatures of their resources against the contents in the CAR
repository).\\
\\
# Content Addressible resources (once uploaded) are immutable. They can only be
garbage collected (by server side daemons).
This allows TaskTrackers to skip mtime checks on such resources.
The CAR functionality is exposed to clients in two ways:
- a boolean configuration option (defaulting to false) to indicate that
resources added via -libjars/files/archives options are content addressible
- enhancing the Distributed Cache api to mark specific files/archives as CAR
(similar to how specific files can be marked public)
h4. Protocol
Assume a jobclient has a CAR file _F_ on local disk to be uploaded for task
execution. Here's approximately a trace of what happens from the beginning of
the job to it's end:
# Client computes the md5 signature of F (= _md5_F_)
#- One can additionally provide an option to skip this step - the md5 can be
precomputed and stored in a file named _F.md5_ stored alongside _F_. The client
can look for and use the contents of this file as the md5 sum.\\
\\
# The client fetches (in a single filesystem call) the list of md5 signatures
(and their 'atime' attribute among other things) of the CAR repository\\
\\
# If the _md5_F_ already exists in the CAR repository - then the client simply
uses the URI of the existing copy as the resource to be downloaded on the
TaskTrackers\\
#- If the atime of _md5_F_ is older than 1 day, then the client updates the
atime (See #6)\\
\\
# If _md5_F_ does not exist in the CAR repository then Client uploads it to the
CAR repository using _md5_F_ as the name\\
\\
# The TaskTracker, on being requested to run a task requiring CAR resource
_md5_F_ checks whether _md5_F_ is localized.\\
#- If _md5_F_ is already localized - then nothing more needs to be done. the
localized version is used by the Task\\
#- If _md5_F_ is not localized - then its fetched from the CAR repository\\
\\
# A garbage collector (running on the server side - preferably the JT) scans
the CAR repository periodically looking for and deleting resources whose atime
is older than N days. This is similar to the TrashEmptier in the Namenode.\\
\\
# The number _N_ is configurable. The protocol guarantees that no job less than
_N-1_ days in length will have it's resources garbage collected before it
finishes (because of the update atime step in #3). In practice, the total size
of the CAR repository is likely to be very small (relative to other contents in
HDFS) and _N_ can be set to a very high number.
In this protocol - assuming that most jobs are using the same resources - the
vast majority of job submissions make only one file system call (to list the
CAR repository on the job client). Most task executions do not require any
calls to the file system (for purposes of localization). Note that uploads to
the CAR repository will also be rare (in steady state).
h4. Notes
# The garbage collection of localized resources on TaskTrackers happens the
same as today (for resource downloaded via distributed cache). In particular,
no synchronization is required between garbage collection of localized
resources and those of the backing URIs in hdfs.\\
\\
# In step #4 - in the v1 implementation, the client is responsible for
computing the md5. If the client is malicious - it can spoof the md5 (of
important jars) and upload malicious code thereby affecting the execution of
other clients.\\
\\
# In the v1 implementation - the CAR repository is implemented as a fixed
directory in HDFS. The clients must have write permission to the CAR directory
(to upload new resources into it). A malicious client can then delete or modify
resources before they are eligible for garbage collection - potentially
affecting running jobs.
The latter two issues can be solved by having a server side agent control the
addition and deletion of resources to the CAR repository. However this has not
been implemented in v1. The initial implementation only suffices for
environments that can make the assumption of non-malicious clients - but can be
extended to cover more security conscious use cases in the future (with the
attendant burden of more server side apis).
> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars
> and files (because they are using a framework like Hive/Pig) - the same jars
> keep getting uploaded and downloaded repeatedly. The overhead of this
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not
> submit the same files over and again. Identifying and caching execution
> resources by a content signature (md5/sha) would be a good alternative to
> have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.