[ 
https://issues.apache.org/jira/browse/SAMZA-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Riccomini updated SAMZA-5:
--------------------------------

    Affects Version/s: 0.6.0
    
> Better support YARN 2.X
> -----------------------
>
>                 Key: SAMZA-5
>                 URL: https://issues.apache.org/jira/browse/SAMZA-5
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>
> Something that's been bothering me recently is how we can support multiple 
> YARN 2.X clusters, all on different versions.
> At LinkedIn, we run integration and production YARN clusters for Samza on a 
> specific version of YARN, but have a test cluster running a different 
> (usually newer) version. For example, we might run integration and production 
> clusters on 2.0.5-alpha, but the test cluster on 2.1.0.
> The problem we have right now is that it's somewhat difficult to build a 
> binary that can run on all of these clusters. Samza packages are deployed 
> through a .tgz file with a bin and lib directory. The lib directory contains 
> all jars (including Hadoop) that the job needs to run. The bin directory has 
> a run-class.sh script in it, which adds ../lib/*.jar to the classpath.
> To complicate matters, YARN's 2.X versions are all backwards incompatible at 
> the protocol level. YARN 2.0.4-alpha can't run on a 2.0.5-alpha cluster, and 
> vice-versa. At the API-level (for the APIs Samza uses) 2.0.3-alpha, 
> 2.0.4-alpha, and 2.0.5-alpha are all compatible (i.e. you can compile and run 
> Samza with any one of these versions). I haven't tested 2.1.0-beta, yet, but 
> I assume it's API compatible with 2.0.3-alpha an onward (should be verified, 
> though). As a result, the incompatibility issue appears to be purely runtime. 
> YARN clusters simple reject RPC calls from protobuf RPC versions that aren't 
> the same as the version the YARN cluster is running (2.0.5-alpha clusters 
> reject all non-2.0.5-alpha RPC calls).
> Given this set of facts, right now you have to cross-build a Samza tarball 
> for each version of Hadoop that you're running on 
> (my-job-yarn-2.0.3-alpha.tgz, my-job.yarn-2.0.4-alpha.tgz, etc). This is kind 
> of annoying, breaks reproducibility somewhat. Generally, testing on one 
> binary, and running in production on another is kind of a bad idea, even if 
> they're theoretically exactly the same, minus the YARN jars.
> I was thinking a better solution might be to:
> 1. Change the YarnConfig/YarnJob/ClientHelper to allow arbitrary resources 
> with a config scheme like this: yarn.resource.<resource-name>.path=...
> 2. Change run-class.sh to add both ../lib/*.jar and ../.../hadoop-libs/*.jar
> This seems like a good change to me because:
> 1. It allows jobs to attach arbitrary resources in addition to hadoop-libs.
> 2. It allows a single binary to be run with multiple versions of hadoop (vs 
> one binary per Hadoop/YARN version).
> 3. If a non-manual (automated, or CI-based) job deployment mechanism is used, 
> which automatically runs run-job.sh, then that deployment mechanism can 
> auto-inject the appropriate version of YARN (using --config 
> yarn.resource.hadoop-libs.path=...) based on the YARN cluster that the 
> deployment mechanism is executing the job on.
> 4. If run-job.sh is being run by a user, they can use either the --config 
> based approach (described in number 3), or they can use a config-based 
> approach (test.properties has a path to YARN 2.1.0-beta, and prod.properties 
> has a path to YARN 2.0.5-alpha).
> 5. It means that rolling out a new version of YARN doesn't require a re-build 
> of every job package that runs on the cluster. You just need to update the 
> config path to point to the new hadoop 
> 6. This would theoretically allow us to switch from a JSON/environment 
> variable config-passing approach to a distributed cache config-passing 
> approach (config is a file on HDFS, or HTTP, or wherever). Right now, config 
> is resolved at job-execution time (run-job.sh/JobRunner) into key/value 
> pairs, and is passed to the AM/containers via environment variable as a JSON 
> blob. This is kind of hacky, has size limits, and probably not the safest 
> thing to do. A better (but perhaps less convenient) approach might be to 
> require config to be a file which gets attached and un-tar'ed along with 
> hadoop-libs and the job package. I haven't fully thought the file-based 
> config feature through, but the proposed change would allow us to support 
> this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to