Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Luciano Resende
On Sat, Apr 16, 2016 at 5:38 PM, Evan Chan  wrote:

> Hi folks,
>
> Sorry to join the discussion late.  I had a look at the design doc
> earlier in this thread, and it was not mentioned what types of
> projects are the targets of this new "spark extras" ASF umbrella
>
> Is the desire to have a maintained set of spark-related projects that
> keep pace with the main Spark development schedule?  Is it just for
> streaming connectors?  what about data sources, and other important
> projects in the Spark ecosystem?
>

The proposal draft below has some more details on what type of projects,
but in summary, "Spark-Extras" would be a good place for any of these
components you mentioned.

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing


>
> I'm worried that this would relegate spark-packages to third tier
> status,


Owen answered a similar question about spark-packages earlier on this
thread, but while "Spark-Extras" would a place in Apache for collaboration
on the development of these extensions, they might still be published to
spark-packages as they existing streaming connectors are today.


> and the promotion of a select set of committers, and the
> project itself, to top level ASF status (a la Arrow) would create a
> further split in the community.
>
>
As for the select set of committers, we have invited all Spark committers
to be committers on the project, and I have updated the project proposal
with the existing set of active Spark committers ( that have committed in
the last one year)


>
> -Evan
>
> On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran 
> wrote:
> >
> >
> >
> >
> >
> > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >>Yeah in support of this statement I think that my primary interest in
> >>this Spark Extras and the good work by Luciano here is that anytime we
> >>take bits out of a code base and “move it to GitHub” I see a bad
> precedent
> >>being set.
> >>
> >>Creating this project at the ASF creates a synergy between *Apache Spark*
> >>which is *at the ASF*.
> >>
> >>We welcome comments and as Luciano said, this is meant to invite and be
> >>open to those in the Apache Spark PMC to join and help.
> >>
> >>Cheers,
> >>Chris
> >
> > As one of the people named, here's my rationale:
> >
> > Throwing stuff into github creates that world of branches, and its no
> longer something that could be managed through the ASF, where managed is:
> governance, participation and a release process that includes auditing
> dependencies, code-signoff, etc,
> >
> >
> > As an example, there's a mutant hive JAR which spark uses, that's
> something which currently evolved between my repo and Patrick Wendell's;
> now that Josh Rosen has taken on the bold task of "trying to move spark and
> twill to Kryo 3", he's going to own that code, and now the reference branch
> will move somewhere else.
> >
> > In contrast, if there was an ASF location for this, then it'd be
> something anyone with commit rights could maintain and publish
> >
> > (actually, I've just realised life is hard here as the hive is a fork of
> ASF hive —really the spark branch should be a separate branch in Hive's own
> repo ... But the concept is the same: those bits of the codebase which are
> core parts of the spark project should really live in or near it)
> >
> >
> > If everyone on the spark commit list gets write access to this extras
> repo, moving things is straightforward. Release wise, things could/should
> be in sync.
> >
> > If there's a risk, its the eternal problem of the contrib/ dir 
> Stuff ends up there that never gets maintained. I don't see that being any
> worse than if things were thrown to the wind of a thousand github repos: at
> least now there'd be a central issue tracking location.
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Using local-cluster mode for testing Spark-related projects

2016-04-16 Thread Evan Chan
Hey folks,

I'd like to use local-cluster mode in my Spark-related projects to
test Spark functionality in an automated way in a simulated local
cluster.The idea is to test multi-process things in a much easier
fashion than setting up a real cluster.   However, getting this up and
running in a separate project (I'm using Scala 2.10 and ScalaTest) is
nontrivial.   Does anyone have any suggestions to get up and running?

This is what I've observed so far (I'm testing against 1.5.1, but
suspect this would apply equally to 1.6.x):

- One needs to have a real Spark distro and point to it using SPARK_HOME
- SPARK_SCALA_VERSION needs to be set
- One needs to manually inject jar paths, otherwise dependencies are
missing.  For example, build an assembly jar of all your deps.  Java
class directory hierarchies don't seem to work with the setJars(...).

How does Spark's internal scripts make it possible to run
local-cluster mode and set up all the class paths correctly?   And, is
it possible to mimic this setup for external Spark projects?

thanks,
Evan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Evan Chan
Hi folks,

Sorry to join the discussion late.  I had a look at the design doc
earlier in this thread, and it was not mentioned what types of
projects are the targets of this new "spark extras" ASF umbrella

Is the desire to have a maintained set of spark-related projects that
keep pace with the main Spark development schedule?  Is it just for
streaming connectors?  what about data sources, and other important
projects in the Spark ecosystem?

I'm worried that this would relegate spark-packages to third tier
status, and the promotion of a select set of committers, and the
project itself, to top level ASF status (a la Arrow) would create a
further split in the community.

-Evan

On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran  wrote:
>
>
>
>
>
> On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" 
>  wrote:
>
>>Yeah in support of this statement I think that my primary interest in
>>this Spark Extras and the good work by Luciano here is that anytime we
>>take bits out of a code base and “move it to GitHub” I see a bad precedent
>>being set.
>>
>>Creating this project at the ASF creates a synergy between *Apache Spark*
>>which is *at the ASF*.
>>
>>We welcome comments and as Luciano said, this is meant to invite and be
>>open to those in the Apache Spark PMC to join and help.
>>
>>Cheers,
>>Chris
>
> As one of the people named, here's my rationale:
>
> Throwing stuff into github creates that world of branches, and its no longer 
> something that could be managed through the ASF, where managed is: 
> governance, participation and a release process that includes auditing 
> dependencies, code-signoff, etc,
>
>
> As an example, there's a mutant hive JAR which spark uses, that's something 
> which currently evolved between my repo and Patrick Wendell's; now that Josh 
> Rosen has taken on the bold task of "trying to move spark and twill to Kryo 
> 3", he's going to own that code, and now the reference branch will move 
> somewhere else.
>
> In contrast, if there was an ASF location for this, then it'd be something 
> anyone with commit rights could maintain and publish
>
> (actually, I've just realised life is hard here as the hive is a fork of ASF 
> hive —really the spark branch should be a separate branch in Hive's own repo 
> ... But the concept is the same: those bits of the codebase which are core 
> parts of the spark project should really live in or near it)
>
>
> If everyone on the spark commit list gets write access to this extras repo, 
> moving things is straightforward. Release wise, things could/should be in 
> sync.
>
> If there's a risk, its the eternal problem of the contrib/ dir  Stuff 
> ends up there that never gets maintained. I don't see that being any worse 
> than if things were thrown to the wind of a thousand github repos: at least 
> now there'd be a central issue tracking location.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Steve Loughran





On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" 
 wrote:

>Yeah in support of this statement I think that my primary interest in
>this Spark Extras and the good work by Luciano here is that anytime we
>take bits out of a code base and “move it to GitHub” I see a bad precedent
>being set.
>
>Creating this project at the ASF creates a synergy between *Apache Spark*
>which is *at the ASF*.
>
>We welcome comments and as Luciano said, this is meant to invite and be
>open to those in the Apache Spark PMC to join and help.
>
>Cheers,
>Chris

As one of the people named, here's my rationale:

Throwing stuff into github creates that world of branches, and its no longer 
something that could be managed through the ASF, where managed is: governance, 
participation and a release process that includes auditing dependencies, 
code-signoff, etc,


As an example, there's a mutant hive JAR which spark uses, that's something 
which currently evolved between my repo and Patrick Wendell's; now that Josh 
Rosen has taken on the bold task of "trying to move spark and twill to Kryo 3", 
he's going to own that code, and now the reference branch will move somewhere 
else.

In contrast, if there was an ASF location for this, then it'd be something 
anyone with commit rights could maintain and publish

(actually, I've just realised life is hard here as the hive is a fork of ASF 
hive —really the spark branch should be a separate branch in Hive's own repo 
... But the concept is the same: those bits of the codebase which are core 
parts of the spark project should really live in or near it)


If everyone on the spark commit list gets write access to this extras repo, 
moving things is straightforward. Release wise, things could/should be in sync.

If there's a risk, its the eternal problem of the contrib/ dir  Stuff ends 
up there that never gets maintained. I don't see that being any worse than if 
things were thrown to the wind of a thousand github repos: at least now there'd 
be a central issue tracking location.