[ 
https://issues.apache.org/jira/browse/SAMZA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056771#comment-14056771
 ] 

Chris Riccomini commented on SAMZA-307:
---------------------------------------

bq. call some jave code to upload the assembly (all the samza needed jars and 
is already-compiled) and user's job jar (which changes frequently) to the HDFS

Isn't this what [hdfs fs 
-put|http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#put] is for?

In my mind, the process is:

# Build your job package.
# Run `hdfs fs -put` to put your package into HDFS. You supply the directory.
# Run run-job.sh --yarn.package.path=<the directory you just gave to hdfs fs 
-put>

bq. Will the job package in HDFS get cleaned up after a job is shut down?

I don't think so. If we follow the flow I describe above, it's up to the user 
to clean up after themselves. There's no really clean way of doing this that I 
know of. If the NM were to do it, and then you'd have to re-put your binary 
after every successful execution, which isn't ideal.

bq. is it possible for Samza to run if we upload two separate packages, one is 
Samza system jars and one is user's jar, to one folder of HDFS?

Yes, it is possible. You can add multiple resources to a YARN job. We just add 
one right now (the job tgz). It is possible to add a second resource (the YARN 
runtime). It is also possible to have multiple runtimes available for the user 
to pick from. As Martin suggests, this DOES lead to the potential for runtime 
errors, since a user might pick the wrong runtime. It also might lead to issues 
where there are some Samza JARs in the user's job package that get intermingled 
with different versions of the same JAR from the runtime package. In such a 
case, no dependency resolution occurs, and you'll probably get method-not-found 
junk, etc.

The positive thing about splitting the dependencies is that it reduces the job 
package size dramatically since everyone is not bundling all of Samza, Kafka, 
ZK, and YARN in their job package. It's just bundled once for each Samza 
version.

Half-baked idea: support both. If we allow a job to attach arbitrary resources 
to their Samza job (discussed in SAMZA-5), and tweak run-class.sh to support 
arbitrary classpaths in some way, then there's nothing stopping a user from 
attaching an HDFS URI pointing at a YARN runtime. If they don't, then they'd 
just have to make sure everything they need to run their job is located in 
their job's tarball (as it is currently). This approach would also allow for 
attaching other resources that might be in HDFS, which is useful for moving 
around static data sets, etc.

bq. However, if I understand correctly, Samza job can still not run on a 
secured YARN cluster, right?

Correct. No effort has been made to run Samza in a secure YARN grid because 
we're blocked on long-running service issues in secure YARN (e.g. security 
tokens expire after a week).

> Simplify YARN deploy procedure 
> -------------------------------
>
>                 Key: SAMZA-307
>                 URL: https://issues.apache.org/jira/browse/SAMZA-307
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Yan Fang
>
> Currently, we have two ways of deploying the samza job to YARN cluster, from 
> [HDFS|https://samza.incubator.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html]
>  and [Http | 
> https://samza.incubator.apache.org/learn/tutorials/0.7.0/run-in-multi-node-yarn.html],
>  but neither of them is out-of-box. Users have to go through the tutorial, 
> add dependencies, recompile, put the job package to HDFS or Http and then 
> finally run. I feel it is a little cumbersome sometimes. We maybe able to 
> provide a simpler way to deploy the job.
> When users have YARN and HDFS in the same cluster (such as CDH5), we can 
> provide a job-submit script which does:
> 1. take cluster configuration
> 2. call some jave code to upload the assembly (all the samza needed jars and 
> is already-compiled) and user's job jar (which changes frequently) to the HDFS
> 3. run the job as usual. 
> Therefore, the users only need to run one command line *instead of*:
> 1. going step by step from the tutorial during their first job
> 2. assembling all code and uploading to HDFS manually every time they make 
> changes to their job. 
> (Yes, I learnt it from [Spark's Yarn 
> deploy|http://spark.apache.org/docs/latest/running-on-yarn.html] and [their 
> code|https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala]
>  ) 
> When users only have YARN, I think they have no way but start the http server 
> as tutorial. 
> What do you think? Does the simplification make sense? Or the Samza will have 
> some difficulties (issues) if we do the deploy in this way? Thank you.
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to