Re: [DISCUSS] Spark runner packaging

Amit Sela Fri, 08 Jul 2016 14:36:24 -0700

I like the profile idea Dan, mostly because while I believe we should do
our best to make adoption easier, we should still default to the actual use
case where such pipelines will run on clusters..


On Fri, Jul 8, 2016 at 1:53 AM Dan Halperin <[email protected]>
wrote:

> Thanks Amit, that does clear things up!
>
> On Thu, Jul 7, 2016 at 3:30 PM, Amit Sela <[email protected]> wrote:
>
> > I don't think that the Spark runner is special, it's just the way it was
> > until now and that's why I brought up the subject here.
> >
> > The main issue is that currently, if a user wants to write a beam app
> using
> > the Spark runner, he'll have to provide the Spark dependencies, or he'll
> > get a ClassNotFoundException (which is exactly the case for
> beam-examples).
> > This of course happens because the Spark runner has provided dependency
> on
> > Spark (not transitive).
> >
>
> Having provided dependencies and making the user include them in their pom
> is
> pretty normal, I think. We already require users to provide a slf4j logger
> and
> Hamcrest+Junit (if they use PAssert).
>     (We including all these in the examples pom.xml
> <
> https://github.com/apache/incubator-beam/blob/master/examples/java/pom.xml#L286
> >
> .)
>
> I don't see any problem for a user who wants to use the Spark runner to add
> these
> provided deps to their pom (aka, putting them as runtime deps in examples
> pom.xml).
>
>
> > The Flink runner avoids this issue by having a compile dependency on
> flink,
> > thus being transitive.
> >
> > By having the cluster provide them I mean that the Spark installation is
> > aware of the binaries pre-deployed on the cluster and adds them to the
> > classpath of the app submitted for execution on the cluster - this is
> > common (AFAIK) for Spark and Spark on YARN, and vendors provide similar
> > binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks).
> >
>
> Makes sense. So a user submitting to a cluster would submit a jar and
> command-line
> options, and the cluster itself would add the provided deps.
>
>
> >  Putting aside our (Beam) issues, the current artifact
> "beam-runners-spark"
> > is more suitable to run on clusters with pre-deployed binaries rather
> than
> > a
> > quick standalone execution with a single dependency that takes care of
> > everything (Spark related),
>
>
> great!
>
>
> > but is more cumbersome for users trying to get
> > going for the first time, which is not good!.
> >
>
> We should decide which experience we're trying to optimize for (I'd lean
> cluster), but
> I think that we should update examples pom.xml with the support.
>
> * For cluster mode default, we would add a profile for 'local' mode
>   (-PsparkIncluded or something) that overrides the provided deps to be
> runtime
>   deps instead.
>
> * We can include switching the profile for local mode in the "getting
> started" instructions.
>
> Dan
>
> I guess Flink uses a compile dependency for the same reason Spark uses
> > provided - because it fits them - what about other runners ?
> >
> > Hope this clarifies some of the questions here.
> >
> > Thanks,
> > Amit
> >
> > On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <[email protected]
> >
> > wrote:
> >
> > > hey folks,
> > >
> > > In general, we should optimize for running on clusters rather than
> > running
> > > locally. Examples is a runner-independent module, with non-compile-time
> > > deps on runners. Most runners are currently listed as being runtime
> deps
> > --
> > > it sounds like that works, for most cases, but might not be the best
> fit
> > > for Spark.
> > >
> > > Q: What does dependencies being provided by the cluster mean? I'm a
> > little
> > > naive here, but how would a user submit a pipeline to a Spark cluster
> > > without actually depending on Spark in mvn? Is it not by running the
> main
> > > method in an example like in all other runners?
> > >
> > > I'd like to understand the above better, but suppose that to optimize
> for
> > > Spark-on-a-cluster, we should default to provided deps in the examples.
> > > That would be fine -- but couldn't we just make a profile for local
> Spark
> > > that overrides the deps from provided to runtime?
> > >
> > > To summarize, I think we do not need new artifacts, but we could use a
> > > profile for local testing if absolutely necessary.
> > >
> > > Thanks,
> > > Dan
> > >
> > > On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <[email protected]>
> wrote:
> > >
> > > > Good discussion subject Amit,
> > > >
> > > > I let the whole beam distribution subjects continue in BEAM-320,
> > however
> > > > there
> > > > is a not yet discussed aspect of the spark runner, the maven
> behavior:
> > > >
> > > > When you import the beam spark runner as a dependency you are obliged
> > to
> > > > provide
> > > > your spark dependencies by hand too, in the other runners once you
> > import
> > > > the
> > > > runner everything just works e.g.  google-cloud-dataflow-runner and
> > > > flink-runner.  I understand the arguments for the current setup (the
> > ones
> > > > you
> > > > mention), but I think it is more user friendly to be consistent with
> > the
> > > > other
> > > > runners and have something that just works as the default (and solve
> > the
> > > > examples issue as a consequence).  Anyway I think in the spark case
> we
> > > need
> > > > both, an 'spark-included' flavor and the current one that it is
> really
> > > > useful to
> > > > include the runner as a spark library dependency (like Jesse did in
> his
> > > > video) or
> > > > as a spark-package.
> > > >
> > > > Actually both the all-included and the runner only make sense for
> flink
> > > too
> > > > but this is a different discussion ;)
> > > >
> > > > What do you think about this ? What do the others think ?
> > > >
> > > > Ismaël
> > > >
> > > >
> > > > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > No problem and good idea to discuss in the Jira.
> > > > >
> > > > > Actually, I started to experiment a bit beam distributions on a
> > branch
> > > > > (that I can share with people interested).
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 07/07/2016 10:12 PM, Amit Sela wrote:
> > > > >
> > > > >> Thanks JB, I've missed that one.
> > > > >>
> > > > >> I suggest we continue this in the ticket comments.
> > > > >>
> > > > >> Thanks,
> > > > >> Amit
> > > > >>
> > > > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <
> > [email protected]
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> Hi Amit,
> > > > >>>
> > > > >>> I think your proposal is related to:
> > > > >>>
> > > > >>> https://issues.apache.org/jira/browse/BEAM-320
> > > > >>>
> > > > >>> As described in the Jira, I'm planning to provide (in dedicated
> > Maven
> > > > >>> modules) is a Beam distribution including:
> > > > >>> - an uber jar to wrap the dependencies
> > > > >>> - the underlying runtime backends
> > > > >>> - etc
> > > > >>>
> > > > >>> Regards
> > > > >>> JB
> > > > >>>
> > > > >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > > > >>>
> > > > >>>> Hi everyone,
> > > > >>>>
> > > > >>>> Lately I've encountered a number of issues concerning the fact
> > that
> > > > the
> > > > >>>> Spark runner does not package Spark along with it and forcing
> > people
> > > > to
> > > > >>>>
> > > > >>> do
> > > > >>>
> > > > >>>> this on their own.
> > > > >>>> In addition, this seems to get in the way of having
> beam-examples
> > > > >>>>
> > > > >>> executed
> > > > >>>
> > > > >>>> against the Spark runner, again because it would have to add
> Spark
> > > > >>>> dependencies.
> > > > >>>>
> > > > >>>> When running on a cluster (which I guess was the original goal
> > > here),
> > > > it
> > > > >>>>
> > > > >>> is
> > > > >>>
> > > > >>>> recommended to have Spark provided by the cluster - this makes
> > sense
> > > > for
> > > > >>>> Spark clusters and more so for Spark + YARN clusters where you
> > might
> > > > >>>> have
> > > > >>>> your Spark built against a specific Hadoop version or using a
> > vendor
> > > > >>>> distribution.
> > > > >>>>
> > > > >>>> In order to make the runner more accessible to new adopters, I
> > > suggest
> > > > >>>> to
> > > > >>>> consider releasing a "spark-included" artifact as well.
> > > > >>>>
> > > > >>>> Thoughts ?
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Amit
> > > > >>>>
> > > > >>>>
> > > > >>> --
> > > > >>> Jean-Baptiste Onofré
> > > > >>> [email protected]
> > > > >>> http://blog.nanthrax.net
> > > > >>> Talend - http://www.talend.com
> > > > >>>
> > > > >>>
> > > > >>
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > [email protected]
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Spark runner packaging

Reply via email to