[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

Joydeep Sen Sarma (JIRA) Mon, 16 Aug 2010 02:34:54 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898835#action_12898835
 ]


Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------

the proposal is below - rephrases some of the discussions above, addresses some 
of the comments around race conditions and points out limitations. Junjie will 
post a patch tomorrow (which probably needs some more work).

h4. Background

Hadoop map-reduce jobs commonly require jars, executables, archives and other 
resources for task execution on hadoop cluster nodes. A common deployment 
pattern for Hadoop applications is that the required resources are deployed 
centrally by administrators (either on a shared file system or deployed on 
standard local file system paths by package management tools). Users launch 
Hadoop jobs from these installation points. Applications use apis 
(-libjars/files/archives) provided by Hadoop to upload resources (from the 
installation point) so that they are made available for task execution. This 
behavior makes deployment of Hadoop applications very easy (just use standard 
package management tools).

As an example, Facebook has a few different Hive installations (of different 
versions) deployed on NFS filer. Each has a multitude of jar files - with only 
some differing across different Hive versions. Users also maintain a repository 
of map-reduce scripts and jar files contain Hive extensions (user defined 
functions) on a NFS filer. Any installation of Hive can be used to execute jobs 
against any of multiple map-reduce clusters. Most of the jar files are also 
required locally (by the Hive client) - either for query compilation or for 
local execution (either in hadoop local mode or for some special types of 
queries).

h4. Problems

With the above arrangement - each (non local-mode) Hadoop job will upload all 
the required jar files into HDFS. TaskTrackers will download these jars from 
HDFS (at most once per job) and check modification times of downloaded files 
(second task onwards) The following overheads are observed:

- Job submission latency is impacted because of the need to serially upload 
multiple jar files into HDFS. At Facebook - we typically see 5-6 seconds of 
pause in this stage (depends on how responsive DFS is on a given day)
- There is some latency in setting up the first task as resources must be 
downloaded from HDFS. We have typically observed this to be around 2-3 seconds 
at Facebook. 
- For subsequent tasks - the latency impact is not as high - but the mtime 
check adds to general Namenode pressure.

h4. Observations

- jars and other resources are shared across different jobs and users. there 
are, in fact, hardly any resources that are not shared.
- these resources are meant to be immutable 

We would like to use these properties to solve some of the overheads in the 
current protocol while retaining the simplicity of the deployment model that 
exists today.

h4. General Approach

We would like to introduce (for lack of a better term) the notion of Content 
Addressible Resources (CAR) that are stored in a central repository in HDFS:

# CAR jars/files/archives are be identified by their content (for example - 
named using their md5 checksum).\\
\\
    This allows different jobs to share resources. Each Job can find out 
whether the resources required by it are already available in HDFS (by 
comparing the md5 signatures of their resources against the contents in the CAR 
repository).\\
\\
# Content Addressible resources (once uploaded) are immutable. They can only be 
garbage collected (by server side daemons).
    This allows TaskTrackers to skip mtime checks on such resources.

The CAR functionality is exposed to clients in two ways:

- a boolean configuration option (defaulting to false) to indicate that 
resources added via -libjars/files/archives options are content addressible
- enhancing the Distributed Cache api to mark specific files/archives as CAR 
(similar to how specific files can be marked public)

h4. Protocol

Assume a jobclient has a CAR file _F_ on local disk to be uploaded for task 
execution. Here's approximately a trace of what happens from the beginning of 
the job to it's end:

# Client computes the md5 signature of F (= _md5_F_)
#- One can additionally provide an option to skip this step - the md5 can be 
precomputed and stored in a file named _F.md5_ stored alongside _F_. The client 
can look for and use the contents of this file as the md5 sum.\\
\\
# The client fetches (in a single filesystem call) the list of md5 signatures 
(and their 'atime' attribute among other things) of the CAR repository\\
\\
# If the _md5_F_ already exists in the CAR repository - then the client simply 
uses the URI of the existing copy as the resource to be downloaded on the 
TaskTrackers\\
#- If the atime of _md5_F_ is older than 1 day, then the client updates the 
atime (See #6)\\
\\
# If _md5_F_ does not exist in the CAR repository then Client uploads it to the 
CAR repository using _md5_F_ as the name\\
\\
# The TaskTracker, on being requested to run a task requiring CAR resource 
_md5_F_ checks whether _md5_F_ is localized.\\
#- If _md5_F_ is already localized - then nothing more needs to be done. the 
localized version is used by the Task\\
#- If _md5_F_ is not localized - then its fetched from the CAR repository\\
\\
#  A garbage collector (running on the server side - preferably the JT) scans 
the CAR repository periodically looking for and deleting resources whose atime 
is older than N days. This is similar to the TrashEmptier in the Namenode.\\
\\
# The number _N_ is configurable. The protocol guarantees that no job less than 
_N-1_ days in length will have it's resources garbage collected before it 
finishes (because of the update atime step in #3). In practice, the total size 
of the CAR repository is likely to be very small (relative to other contents in 
HDFS) and _N_ can be set to a very high number. 

In this protocol - assuming that most jobs are using the same resources - the 
vast majority of job submissions make only one file system call (to list the 
CAR repository on the job client). Most task executions do not require any 
calls to the file system (for purposes of localization). Note that uploads to 
the CAR repository will also be rare (in steady state).

h4. Notes

# The garbage collection of localized resources on TaskTrackers happens the 
same as today (for resource downloaded via distributed cache).  In particular, 
no synchronization is required between garbage collection of localized 
resources and those of the backing URIs in hdfs.\\
\\
# In step #4 - in the v1 implementation, the client is responsible for 
computing the md5. If the client is malicious - it can spoof the md5 (of 
important jars) and upload malicious code thereby affecting the execution of 
other clients.\\
\\
# In the v1 implementation - the CAR repository is implemented as a fixed 
directory in HDFS. The clients must have write permission to the CAR directory 
(to upload new resources into it). A malicious client can then delete or modify 
resources before they are eligible for garbage collection - potentially 
affecting running jobs.

The latter two issues can be solved by having a server side agent control the 
addition and deletion of resources to the CAR repository. However this has not 
been implemented in v1. The initial implementation only suffices for 
environments that can make the assumption of non-malicious clients - but can be 
extended to cover more security conscious use cases in the future (with the 
attendant burden of more server side apis).

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again

Reply via email to