[
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898956#action_12898956
]
M. C. Srivas commented on MAPREDUCE-1901:
-----------------------------------------
Content-addressable is one way to solve this problem, and it seems like an
extremely heavy-weight approach
1. more processing to do whenever a file is added to the file-system
2. reliability issues getting the signature to match the contents across
failures/re-replication/etc
3. a repository of signatures in HDFS is yet another single-point of
failure, and yet another database that needs to be maintained (recovery code to
handle "no-data-corruption" on a reboot, scaling it as more files added,
backup/restore, HA, etc)
Looks like there are a variety simpler approaches possible, a few of which come
to mind immediately, and are list below in increasing order of complexity.
1. use distcp or something similar to copy the files onto local disk whenever
there is a new version of Hive released , and set pathnames to that. That is,
different versions of a set of files are kept in a different directory, and
pathnames are used to distinguish them. For example, we do not do a md5 check
of "/bin/ls" every time we need to run it. We set our pathname appropriately.
If there is a different version of "ls" we prefer to use, say, in
"/my/local/bin", then we get that by setting /my/local/bin ahead of other
paths in our pathname.
2. instead of implementing a bulk "getSignatures" call to replace several
"get_mtime" calls, why not implement a bulk get_mtime instead?
3. use a model like AFS with callbacks to implement a on-disk cache that
survives reboots (Dhruba knows AFS very well). In other words, the client
acquires a callback from the name-node for each file it has cached, and HDFS
gurantees it will notify the client when the file is deleted or changed (at
which point, the callback is revoked and the client must re-fetch the file).
The callback lasts for, say, 1 week, and can be persisted on disk. On a
name-node reboot, the client is responsible for re-establishing the callbacks
it already has (akin to a block-report). The client can also choose to return
callbacks, in order to keep the memory requirements on the name-node to a
minimum. No repository of signatures is needed.
> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
> Key: MAPREDUCE-1901
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Joydeep Sen Sarma
> Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars
> and files (because they are using a framework like Hive/Pig) - the same jars
> keep getting uploaded and downloaded repeatedly. The overhead of this
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not
> submit the same files over and again. Identifying and caching execution
> resources by a content signature (md5/sha) would be a good alternative to
> have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.