[ https://issues.apache.org/jira/browse/SAMZA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056771#comment-14056771 ]
Chris Riccomini commented on SAMZA-307: --------------------------------------- bq. call some jave code to upload the assembly (all the samza needed jars and is already-compiled) and user's job jar (which changes frequently) to the HDFS Isn't this what [hdfs fs -put|http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#put] is for? In my mind, the process is: # Build your job package. # Run `hdfs fs -put` to put your package into HDFS. You supply the directory. # Run run-job.sh --yarn.package.path=<the directory you just gave to hdfs fs -put> bq. Will the job package in HDFS get cleaned up after a job is shut down? I don't think so. If we follow the flow I describe above, it's up to the user to clean up after themselves. There's no really clean way of doing this that I know of. If the NM were to do it, and then you'd have to re-put your binary after every successful execution, which isn't ideal. bq. is it possible for Samza to run if we upload two separate packages, one is Samza system jars and one is user's jar, to one folder of HDFS? Yes, it is possible. You can add multiple resources to a YARN job. We just add one right now (the job tgz). It is possible to add a second resource (the YARN runtime). It is also possible to have multiple runtimes available for the user to pick from. As Martin suggests, this DOES lead to the potential for runtime errors, since a user might pick the wrong runtime. It also might lead to issues where there are some Samza JARs in the user's job package that get intermingled with different versions of the same JAR from the runtime package. In such a case, no dependency resolution occurs, and you'll probably get method-not-found junk, etc. The positive thing about splitting the dependencies is that it reduces the job package size dramatically since everyone is not bundling all of Samza, Kafka, ZK, and YARN in their job package. It's just bundled once for each Samza version. Half-baked idea: support both. If we allow a job to attach arbitrary resources to their Samza job (discussed in SAMZA-5), and tweak run-class.sh to support arbitrary classpaths in some way, then there's nothing stopping a user from attaching an HDFS URI pointing at a YARN runtime. If they don't, then they'd just have to make sure everything they need to run their job is located in their job's tarball (as it is currently). This approach would also allow for attaching other resources that might be in HDFS, which is useful for moving around static data sets, etc. bq. However, if I understand correctly, Samza job can still not run on a secured YARN cluster, right? Correct. No effort has been made to run Samza in a secure YARN grid because we're blocked on long-running service issues in secure YARN (e.g. security tokens expire after a week). > Simplify YARN deploy procedure > ------------------------------- > > Key: SAMZA-307 > URL: https://issues.apache.org/jira/browse/SAMZA-307 > Project: Samza > Issue Type: Improvement > Reporter: Yan Fang > > Currently, we have two ways of deploying the samza job to YARN cluster, from > [HDFS|https://samza.incubator.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html] > and [Http | > https://samza.incubator.apache.org/learn/tutorials/0.7.0/run-in-multi-node-yarn.html], > but neither of them is out-of-box. Users have to go through the tutorial, > add dependencies, recompile, put the job package to HDFS or Http and then > finally run. I feel it is a little cumbersome sometimes. We maybe able to > provide a simpler way to deploy the job. > When users have YARN and HDFS in the same cluster (such as CDH5), we can > provide a job-submit script which does: > 1. take cluster configuration > 2. call some jave code to upload the assembly (all the samza needed jars and > is already-compiled) and user's job jar (which changes frequently) to the HDFS > 3. run the job as usual. > Therefore, the users only need to run one command line *instead of*: > 1. going step by step from the tutorial during their first job > 2. assembling all code and uploading to HDFS manually every time they make > changes to their job. > (Yes, I learnt it from [Spark's Yarn > deploy|http://spark.apache.org/docs/latest/running-on-yarn.html] and [their > code|https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala] > ) > When users only have YARN, I think they have no way but start the http server > as tutorial. > What do you think? Does the simplification make sense? Or the Samza will have > some difficulties (issues) if we do the deploy in this way? Thank you. > -- This message was sent by Atlassian JIRA (v6.2#6252)