Good discussion subject Amit, I let the whole beam distribution subjects continue in BEAM-320, however there is a not yet discussed aspect of the spark runner, the maven behavior:
When you import the beam spark runner as a dependency you are obliged to provide your spark dependencies by hand too, in the other runners once you import the runner everything just works e.g. google-cloud-dataflow-runner and flink-runner. I understand the arguments for the current setup (the ones you mention), but I think it is more user friendly to be consistent with the other runners and have something that just works as the default (and solve the examples issue as a consequence). Anyway I think in the spark case we need both, an 'spark-included' flavor and the current one that it is really useful to include the runner as a spark library dependency (like Jesse did in his video) or as a spark-package. Actually both the all-included and the runner only make sense for flink too but this is a different discussion ;) What do you think about this ? What do the others think ? Ismaël On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <[email protected]> wrote: > No problem and good idea to discuss in the Jira. > > Actually, I started to experiment a bit beam distributions on a branch > (that I can share with people interested). > > Regards > JB > > > On 07/07/2016 10:12 PM, Amit Sela wrote: > >> Thanks JB, I've missed that one. >> >> I suggest we continue this in the ticket comments. >> >> Thanks, >> Amit >> >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <[email protected]> >> wrote: >> >> Hi Amit, >>> >>> I think your proposal is related to: >>> >>> https://issues.apache.org/jira/browse/BEAM-320 >>> >>> As described in the Jira, I'm planning to provide (in dedicated Maven >>> modules) is a Beam distribution including: >>> - an uber jar to wrap the dependencies >>> - the underlying runtime backends >>> - etc >>> >>> Regards >>> JB >>> >>> On 07/07/2016 07:49 PM, Amit Sela wrote: >>> >>>> Hi everyone, >>>> >>>> Lately I've encountered a number of issues concerning the fact that the >>>> Spark runner does not package Spark along with it and forcing people to >>>> >>> do >>> >>>> this on their own. >>>> In addition, this seems to get in the way of having beam-examples >>>> >>> executed >>> >>>> against the Spark runner, again because it would have to add Spark >>>> dependencies. >>>> >>>> When running on a cluster (which I guess was the original goal here), it >>>> >>> is >>> >>>> recommended to have Spark provided by the cluster - this makes sense for >>>> Spark clusters and more so for Spark + YARN clusters where you might >>>> have >>>> your Spark built against a specific Hadoop version or using a vendor >>>> distribution. >>>> >>>> In order to make the runner more accessible to new adopters, I suggest >>>> to >>>> consider releasing a "spark-included" artifact as well. >>>> >>>> Thoughts ? >>>> >>>> Thanks, >>>> Amit >>>> >>>> >>> -- >>> Jean-Baptiste Onofré >>> [email protected] >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> >> > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
