[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898956#action_12898956
 ] 

M. C. Srivas commented on MAPREDUCE-1901:
-----------------------------------------

Content-addressable is one way to solve this problem, and it seems like an 
extremely heavy-weight approach
   1. more processing to do whenever a file is added to the file-system
   2. reliability issues getting the signature to match the contents across 
failures/re-replication/etc
   3. a repository of signatures in HDFS is yet another single-point of 
failure, and yet another database that needs to be maintained (recovery code to 
handle "no-data-corruption" on a reboot, scaling it as more files added, 
backup/restore,  HA, etc)

Looks like there are a variety simpler approaches possible, a few of which come 
to mind immediately, and are list below in increasing order of complexity.

  1. use distcp or something similar to copy the files onto local disk whenever 
there is a new version of Hive released , and set pathnames to that. That is,  
different versions of a set of files are kept in a different directory, and 
pathnames are used to distinguish them. For example,  we do not do a md5 check 
of "/bin/ls" every time we need to run it. We set our pathname appropriately. 
If there is a different version of  "ls" we prefer to use, say, in  
"/my/local/bin", then we get that by setting /my/local/bin  ahead of other 
paths in our pathname.

  2. instead of implementing a bulk "getSignatures" call to replace several 
"get_mtime" calls, why not implement a  bulk get_mtime instead? 

  3. use a model like AFS  with callbacks to implement a on-disk cache that 
survives reboots (Dhruba knows AFS very well).  In other words, the client 
acquires a callback from the name-node for each file it has cached, and HDFS 
gurantees it will notify the client when the file is deleted or changed (at 
which point, the callback is revoked and the client must re-fetch the file). 
The callback lasts for, say, 1 week, and can be persisted on disk.  On a 
name-node reboot, the client is responsible for re-establishing the callbacks 
it already has (akin to a block-report). The client can also choose to return 
callbacks, in order to keep the memory requirements on the name-node to a 
minimum.  No repository of signatures is needed.


> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources 
> (jars/files/archives) to a new location in HDFS. Map-reduce nodes involved in 
> executing this job would then download these resources into local disk.
> In an environment where most of the users are using a standard set of jars 
> and files (because they are using a framework like Hive/Pig) - the same jars 
> keep getting uploaded and downloaded repeatedly. The overhead of this 
> protocol (primarily in terms of end-user latency) is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in 
> part, by this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not 
> submit the same files over and again. Identifying and caching execution 
> resources by a content signature (md5/sha) would be a good alternative to 
> have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to