Re: [DISCUSS] Spark runner packaging

Dan Halperin Thu, 07 Jul 2016 14:53:16 -0700

hey folks,

In general, we should optimize for running on clusters rather than running
locally. Examples is a runner-independent module, with non-compile-time
deps on runners. Most runners are currently listed as being runtime deps --
it sounds like that works, for most cases, but might not be the best fit
for Spark.


Q: What does dependencies being provided by the cluster mean? I'm a little
naive here, but how would a user submit a pipeline to a Spark cluster
without actually depending on Spark in mvn? Is it not by running the main
method in an example like in all other runners?

I'd like to understand the above better, but suppose that to optimize for
Spark-on-a-cluster, we should default to provided deps in the examples.
That would be fine -- but couldn't we just make a profile for local Spark
that overrides the deps from provided to runtime?

To summarize, I think we do not need new artifacts, but we could use a
profile for local testing if absolutely necessary.

Thanks,
Dan

On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <[email protected]> wrote:

> Good discussion subject Amit,
>
> I let the whole beam distribution subjects continue in BEAM-320, however
> there
> is a not yet discussed aspect of the spark runner, the maven behavior:
>
> When you import the beam spark runner as a dependency you are obliged to
> provide
> your spark dependencies by hand too, in the other runners once you import
> the
> runner everything just works e.g.  google-cloud-dataflow-runner and
> flink-runner.  I understand the arguments for the current setup (the ones
> you
> mention), but I think it is more user friendly to be consistent with the
> other
> runners and have something that just works as the default (and solve the
> examples issue as a consequence).  Anyway I think in the spark case we need
> both, an 'spark-included' flavor and the current one that it is really
> useful to
> include the runner as a spark library dependency (like Jesse did in his
> video) or
> as a spark-package.
>
> Actually both the all-included and the runner only make sense for flink too
> but this is a different discussion ;)
>
> What do you think about this ? What do the others think ?
>
> Ismaël
>
>
> On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > No problem and good idea to discuss in the Jira.
> >
> > Actually, I started to experiment a bit beam distributions on a branch
> > (that I can share with people interested).
> >
> > Regards
> > JB
> >
> >
> > On 07/07/2016 10:12 PM, Amit Sela wrote:
> >
> >> Thanks JB, I've missed that one.
> >>
> >> I suggest we continue this in the ticket comments.
> >>
> >> Thanks,
> >> Amit
> >>
> >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <[email protected]>
> >> wrote:
> >>
> >> Hi Amit,
> >>>
> >>> I think your proposal is related to:
> >>>
> >>> https://issues.apache.org/jira/browse/BEAM-320
> >>>
> >>> As described in the Jira, I'm planning to provide (in dedicated Maven
> >>> modules) is a Beam distribution including:
> >>> - an uber jar to wrap the dependencies
> >>> - the underlying runtime backends
> >>> - etc
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> Lately I've encountered a number of issues concerning the fact that
> the
> >>>> Spark runner does not package Spark along with it and forcing people
> to
> >>>>
> >>> do
> >>>
> >>>> this on their own.
> >>>> In addition, this seems to get in the way of having beam-examples
> >>>>
> >>> executed
> >>>
> >>>> against the Spark runner, again because it would have to add Spark
> >>>> dependencies.
> >>>>
> >>>> When running on a cluster (which I guess was the original goal here),
> it
> >>>>
> >>> is
> >>>
> >>>> recommended to have Spark provided by the cluster - this makes sense
> for
> >>>> Spark clusters and more so for Spark + YARN clusters where you might
> >>>> have
> >>>> your Spark built against a specific Hadoop version or using a vendor
> >>>> distribution.
> >>>>
> >>>> In order to make the runner more accessible to new adopters, I suggest
> >>>> to
> >>>> consider releasing a "spark-included" artifact as well.
> >>>>
> >>>> Thoughts ?
> >>>>
> >>>> Thanks,
> >>>> Amit
> >>>>
> >>>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> [email protected]
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [DISCUSS] Spark runner packaging

Reply via email to