Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On Sat, Apr 16, 2016 at 5:38 PM, Evan Chanwrote: > Hi folks, > > Sorry to join the discussion late. I had a look at the design doc > earlier in this thread, and it was not mentioned what types of > projects are the targets of this new "spark extras" ASF umbrella > > Is the desire to have a maintained set of spark-related projects that > keep pace with the main Spark development schedule? Is it just for > streaming connectors? what about data sources, and other important > projects in the Spark ecosystem? > The proposal draft below has some more details on what type of projects, but in summary, "Spark-Extras" would be a good place for any of these components you mentioned. https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing > > I'm worried that this would relegate spark-packages to third tier > status, Owen answered a similar question about spark-packages earlier on this thread, but while "Spark-Extras" would a place in Apache for collaboration on the development of these extensions, they might still be published to spark-packages as they existing streaming connectors are today. > and the promotion of a select set of committers, and the > project itself, to top level ASF status (a la Arrow) would create a > further split in the community. > > As for the select set of committers, we have invited all Spark committers to be committers on the project, and I have updated the project proposal with the existing set of active Spark committers ( that have committed in the last one year) > > -Evan > > On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran > wrote: > > > > > > > > > > > > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" < > chris.a.mattm...@jpl.nasa.gov> wrote: > > > >>Yeah in support of this statement I think that my primary interest in > >>this Spark Extras and the good work by Luciano here is that anytime we > >>take bits out of a code base and “move it to GitHub” I see a bad > precedent > >>being set. > >> > >>Creating this project at the ASF creates a synergy between *Apache Spark* > >>which is *at the ASF*. > >> > >>We welcome comments and as Luciano said, this is meant to invite and be > >>open to those in the Apache Spark PMC to join and help. > >> > >>Cheers, > >>Chris > > > > As one of the people named, here's my rationale: > > > > Throwing stuff into github creates that world of branches, and its no > longer something that could be managed through the ASF, where managed is: > governance, participation and a release process that includes auditing > dependencies, code-signoff, etc, > > > > > > As an example, there's a mutant hive JAR which spark uses, that's > something which currently evolved between my repo and Patrick Wendell's; > now that Josh Rosen has taken on the bold task of "trying to move spark and > twill to Kryo 3", he's going to own that code, and now the reference branch > will move somewhere else. > > > > In contrast, if there was an ASF location for this, then it'd be > something anyone with commit rights could maintain and publish > > > > (actually, I've just realised life is hard here as the hive is a fork of > ASF hive —really the spark branch should be a separate branch in Hive's own > repo ... But the concept is the same: those bits of the codebase which are > core parts of the spark project should really live in or near it) > > > > > > If everyone on the spark commit list gets write access to this extras > repo, moving things is straightforward. Release wise, things could/should > be in sync. > > > > If there's a risk, its the eternal problem of the contrib/ dir > Stuff ends up there that never gets maintained. I don't see that being any > worse than if things were thrown to the wind of a thousand github repos: at > least now there'd be a central issue tracking location. > -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Using local-cluster mode for testing Spark-related projects
Hey folks, I'd like to use local-cluster mode in my Spark-related projects to test Spark functionality in an automated way in a simulated local cluster.The idea is to test multi-process things in a much easier fashion than setting up a real cluster. However, getting this up and running in a separate project (I'm using Scala 2.10 and ScalaTest) is nontrivial. Does anyone have any suggestions to get up and running? This is what I've observed so far (I'm testing against 1.5.1, but suspect this would apply equally to 1.6.x): - One needs to have a real Spark distro and point to it using SPARK_HOME - SPARK_SCALA_VERSION needs to be set - One needs to manually inject jar paths, otherwise dependencies are missing. For example, build an assembly jar of all your deps. Java class directory hierarchies don't seem to work with the setJars(...). How does Spark's internal scripts make it possible to run local-cluster mode and set up all the class paths correctly? And, is it possible to mimic this setup for external Spark projects? thanks, Evan - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
Hi folks, Sorry to join the discussion late. I had a look at the design doc earlier in this thread, and it was not mentioned what types of projects are the targets of this new "spark extras" ASF umbrella Is the desire to have a maintained set of spark-related projects that keep pace with the main Spark development schedule? Is it just for streaming connectors? what about data sources, and other important projects in the Spark ecosystem? I'm worried that this would relegate spark-packages to third tier status, and the promotion of a select set of committers, and the project itself, to top level ASF status (a la Arrow) would create a further split in the community. -Evan On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughranwrote: > > > > > > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" > wrote: > >>Yeah in support of this statement I think that my primary interest in >>this Spark Extras and the good work by Luciano here is that anytime we >>take bits out of a code base and “move it to GitHub” I see a bad precedent >>being set. >> >>Creating this project at the ASF creates a synergy between *Apache Spark* >>which is *at the ASF*. >> >>We welcome comments and as Luciano said, this is meant to invite and be >>open to those in the Apache Spark PMC to join and help. >> >>Cheers, >>Chris > > As one of the people named, here's my rationale: > > Throwing stuff into github creates that world of branches, and its no longer > something that could be managed through the ASF, where managed is: > governance, participation and a release process that includes auditing > dependencies, code-signoff, etc, > > > As an example, there's a mutant hive JAR which spark uses, that's something > which currently evolved between my repo and Patrick Wendell's; now that Josh > Rosen has taken on the bold task of "trying to move spark and twill to Kryo > 3", he's going to own that code, and now the reference branch will move > somewhere else. > > In contrast, if there was an ASF location for this, then it'd be something > anyone with commit rights could maintain and publish > > (actually, I've just realised life is hard here as the hive is a fork of ASF > hive —really the spark branch should be a separate branch in Hive's own repo > ... But the concept is the same: those bits of the codebase which are core > parts of the spark project should really live in or near it) > > > If everyone on the spark commit list gets write access to this extras repo, > moving things is straightforward. Release wise, things could/should be in > sync. > > If there's a risk, its the eternal problem of the contrib/ dir Stuff > ends up there that never gets maintained. I don't see that being any worse > than if things were thrown to the wind of a thousand github repos: at least > now there'd be a central issue tracking location. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends
On 15/04/2016, 17:41, "Mattmann, Chris A (3980)"wrote: >Yeah in support of this statement I think that my primary interest in >this Spark Extras and the good work by Luciano here is that anytime we >take bits out of a code base and “move it to GitHub” I see a bad precedent >being set. > >Creating this project at the ASF creates a synergy between *Apache Spark* >which is *at the ASF*. > >We welcome comments and as Luciano said, this is meant to invite and be >open to those in the Apache Spark PMC to join and help. > >Cheers, >Chris As one of the people named, here's my rationale: Throwing stuff into github creates that world of branches, and its no longer something that could be managed through the ASF, where managed is: governance, participation and a release process that includes auditing dependencies, code-signoff, etc, As an example, there's a mutant hive JAR which spark uses, that's something which currently evolved between my repo and Patrick Wendell's; now that Josh Rosen has taken on the bold task of "trying to move spark and twill to Kryo 3", he's going to own that code, and now the reference branch will move somewhere else. In contrast, if there was an ASF location for this, then it'd be something anyone with commit rights could maintain and publish (actually, I've just realised life is hard here as the hive is a fork of ASF hive —really the spark branch should be a separate branch in Hive's own repo ... But the concept is the same: those bits of the codebase which are core parts of the spark project should really live in or near it) If everyone on the spark commit list gets write access to this extras repo, moving things is straightforward. Release wise, things could/should be in sync. If there's a risk, its the eternal problem of the contrib/ dir Stuff ends up there that never gets maintained. I don't see that being any worse than if things were thrown to the wind of a thousand github repos: at least now there'd be a central issue tracking location.