subject:"\"SPARK\\\\\\\-13843 and future of streaming backends\""

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-28 Thread Luciano Resende

Just want to provide a quick update that we have submitted the "Spark
Extras" proposal for review by the Apache board (see link below with the
contents).

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing

Note that we are in the quest for a project name that does not have "Spark"
as part of it, and we will provide an update here when we find a suitable
name. Suggestions are welcome (please send them directly to my inbox to
avoid flooding the mailing list).

Thanks


On Sun, Apr 17, 2016 at 9:16 AM, Luciano Resende 
wrote:

>
>
> On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin  wrote:
>
>> First, really thank you for leading the discussion.
>>
>> I am concerned that it'd hurt Spark more than it helps. As many others
>> have pointed out, this unnecessarily creates a new tier of connectors or
>> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
>> We can alleviate this concern by not having "Spark" in the name, and the
>> project proposal and documentation should label clearly that this is not
>> affiliated with Spark.
>>
>
> I really thought we could use the Spark name (e.g. similar to
> spark-packages) as this project is really aligned and dedicated to curating
> extensions to Apache Spark and that's why we were inviting Spark PMC
> members to join the new project PMC so that Apache Spark has the necessary
> oversight and influence on the project direction. I understand folks have
> concerns with the name, and thus we will start looking into name
> alternatives unless there is any way I could address the community concerns
> around this.
>
>
>>
>> Also Luciano - assuming you are interested in creating a project like
>> this and find a home for the connectors that were removed, I find it
>> surprising that few of the initially proposed PMC members have actually
>> contributed much to the connectors, and people that have contributed a lot
>> were left out. I am sure that is just an oversight.
>>
>>
> Reynold, thanks for your concern, we are not leaving anyone out, we took
> the following criteria to identify initial PMC/Committers list as described
> on the first e-mail on this thread:
>
>- Spark Committers and Apache Members can request to participate as PMC
> members
>- All active spark committers (committed on the last one year) will
> have write access to the project (committer access)
>- Other committers can request to become committers.
>- Non committers would be added based on meritocracy after the start of
> the project.
>
> Based on this criteria, all people that have expressed interest in joining
> the project PMC has been added to it, but I don't feel comfortable adding
> names to it at my will. And I have updated the list of committers and
> currently we have the following on the draft proposal:
>
>
> Initial PMC
>
>
>-
>
>Luciano Resende (lresende AT apache DOT org) (Apache Member)
>-
>
>Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
>board member)
>-
>
>Steve Loughran (stevel AT apache DOT org) (Apache Member)
>-
>
>Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
>-
>
>Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
>committer)
>-
>
>Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
>-
>
>Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)
>
>
> Initial Committers (write access to active Spark committers that have
> committed in the last one year)
>
>
>-
>
>Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
>-
>
>Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
>-
>
>Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
>-
>
>Davies Liu (davies AT apache DOT org) (Apache Spark)
>-
>
>DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
>-
>
>Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
>-
>
>Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
>-
>
>Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark)
>-
>
>Imran Rashid (irashid AT apache DOT org) (Apache Spark)
>-
>
>Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark)
>-
>
>Josh Rosen (joshrosen AT apache DOT org) (Apache Spark)
>-
>
>Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark)
>-
>
>Cheng Lian (lian AT apache DOT org) (Apache Spark)
>-
>
>Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark)
>-
>
>Michael Armbrust (marmbrus AT apache DOT org) (Apache Spark)
>-
>
>Matei Alexandru Zaharia (matei AT apache DOT org) (Apache Spark)
>-
>
>Xiangrui Meng (meng AT apache DOT org) (Apache Spark)
>-
>
>Prashant Sharma (prashant AT apache DOT org) (Apache Spark)
>-
>
>Patrick Wendell (pwendell AT apache DOT org) (Apache Spark)
>-
>
>Reynold Xin (rxin AT apache DOT org) (Apache Spark)
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-18 Thread Luciano Resende

Evan,

As long as you meet the criteria we discussed on this thread, you are
welcome to join.

Having said that, I have already seen other contributors that are very
active on some of connectors but are not Apache Committers yet, and i
wanted to be fair, and also avoid using the project as an avenue to bring
new committers to Apache.


On Sun, Apr 17, 2016 at 10:07 PM, Evan Chan  wrote:

> Hi Luciano,
>
> I see that you are inviting all the Spark committers to this new project.
> What about the chief maintainers of important Spark ecosystem projects,
> which are not on the Spark PMC?
>
> For example, I am the chief maintainer of the Spark Job Server, which is
> one of the most active projects in the larger Spark ecosystem.  Would
> projects like this be part of your vision?   If so, it would be a good step
> of faith to reach out to us that maintain the active ecosystem projects.
>  (I’m not saying you should put me in :)  but rather suggesting that if
> this is your aim, it would be good to reach out beyond just the Spark PMC
> members.
>
> thanks,
> Evan
>
> On Apr 17, 2016, at 9:16 AM, Luciano Resende  wrote:
>
>
>
> On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin  wrote:
>
>> First, really thank you for leading the discussion.
>>
>> I am concerned that it'd hurt Spark more than it helps. As many others
>> have pointed out, this unnecessarily creates a new tier of connectors or
>> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
>> We can alleviate this concern by not having "Spark" in the name, and the
>> project proposal and documentation should label clearly that this is not
>> affiliated with Spark.
>>
>
> I really thought we could use the Spark name (e.g. similar to
> spark-packages) as this project is really aligned and dedicated to curating
> extensions to Apache Spark and that's why we were inviting Spark PMC
> members to join the new project PMC so that Apache Spark has the necessary
> oversight and influence on the project direction. I understand folks have
> concerns with the name, and thus we will start looking into name
> alternatives unless there is any way I could address the community concerns
> around this.
>
>
>>
>> Also Luciano - assuming you are interested in creating a project like
>> this and find a home for the connectors that were removed, I find it
>> surprising that few of the initially proposed PMC members have actually
>> contributed much to the connectors, and people that have contributed a lot
>> were left out. I am sure that is just an oversight.
>>
>>
> Reynold, thanks for your concern, we are not leaving anyone out, we took
> the following criteria to identify initial PMC/Committers list as described
> on the first e-mail on this thread:
>
>- Spark Committers and Apache Members can request to participate as PMC
> members
>- All active spark committers (committed on the last one year) will
> have write access to the project (committer access)
>- Other committers can request to become committers.
>- Non committers would be added based on meritocracy after the start of
> the project.
>
> Based on this criteria, all people that have expressed interest in joining
> the project PMC has been added to it, but I don't feel comfortable adding
> names to it at my will. And I have updated the list of committers and
> currently we have the following on the draft proposal:
>
>
> Initial PMC
>
>
>- Luciano Resende (lresende AT apache DOT org) (Apache Member)
>- Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
>board member)
>- Steve Loughran (stevel AT apache DOT org) (Apache Member)
>- Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
>- Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
>committer)
>- Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
>- Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)
>
>
> Initial Committers (write access to active Spark committers that have
> committed in the last one year)
>
>
>- Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
>- Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
>- Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
>- Davies Liu (davies AT apache DOT org) (Apache Spark)
>- DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
>- Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
>- Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
>- Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark)
>- Imran Rashid (irashid AT apache DOT org) (Apache Spark)
>- Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark)
>- Josh Rosen (joshrosen AT apache DOT org) (Apache Spark)
>- Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark)
>- Cheng Lian (lian AT apache DOT org) (Apache Spark)
>- Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark)
>- Michael Armbrust (marm

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-17 Thread Luciano Resende

On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin  wrote:

> First, really thank you for leading the discussion.
>
> I am concerned that it'd hurt Spark more than it helps. As many others
> have pointed out, this unnecessarily creates a new tier of connectors or
> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
> We can alleviate this concern by not having "Spark" in the name, and the
> project proposal and documentation should label clearly that this is not
> affiliated with Spark.
>

I really thought we could use the Spark name (e.g. similar to
spark-packages) as this project is really aligned and dedicated to curating
extensions to Apache Spark and that's why we were inviting Spark PMC
members to join the new project PMC so that Apache Spark has the necessary
oversight and influence on the project direction. I understand folks have
concerns with the name, and thus we will start looking into name
alternatives unless there is any way I could address the community concerns
around this.


>
> Also Luciano - assuming you are interested in creating a project like this
> and find a home for the connectors that were removed, I find it surprising
> that few of the initially proposed PMC members have actually contributed
> much to the connectors, and people that have contributed a lot were left
> out. I am sure that is just an oversight.
>
>
Reynold, thanks for your concern, we are not leaving anyone out, we took
the following criteria to identify initial PMC/Committers list as described
on the first e-mail on this thread:

   - Spark Committers and Apache Members can request to participate as PMC
members
   - All active spark committers (committed on the last one year) will have
write access to the project (committer access)
   - Other committers can request to become committers.
   - Non committers would be added based on meritocracy after the start of
the project.

Based on this criteria, all people that have expressed interest in joining
the project PMC has been added to it, but I don't feel comfortable adding
names to it at my will. And I have updated the list of committers and
currently we have the following on the draft proposal:


Initial PMC


   -

   Luciano Resende (lresende AT apache DOT org) (Apache Member)
   -

   Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
   board member)
   -

   Steve Loughran (stevel AT apache DOT org) (Apache Member)
   -

   Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
   -

   Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
   committer)
   -

   Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
   -

   Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)


Initial Committers (write access to active Spark committers that have
committed in the last one year)


   -

   Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
   -

   Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
   -

   Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
   -

   Davies Liu (davies AT apache DOT org) (Apache Spark)
   -

   DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
   -

   Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
   -

   Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
   -

   Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark)
   -

   Imran Rashid (irashid AT apache DOT org) (Apache Spark)
   -

   Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark)
   -

   Josh Rosen (joshrosen AT apache DOT org) (Apache Spark)
   -

   Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark)
   -

   Cheng Lian (lian AT apache DOT org) (Apache Spark)
   -

   Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark)
   -

   Michael Armbrust (marmbrus AT apache DOT org) (Apache Spark)
   -

   Matei Alexandru Zaharia (matei AT apache DOT org) (Apache Spark)
   -

   Xiangrui Meng (meng AT apache DOT org) (Apache Spark)
   -

   Prashant Sharma (prashant AT apache DOT org) (Apache Spark)
   -

   Patrick Wendell (pwendell AT apache DOT org) (Apache Spark)
   -

   Reynold Xin (rxin AT apache DOT org) (Apache Spark)
   -

   Sanford Ryza (sandy AT apache DOT org) (Apache Spark)
   -

   Kousuke Saruta (sarutak AT apache DOT org) (Apache Spark)
   -

   Shivaram Venkataraman (shivaram AT apache DOT org) (Apache Spark)
   -

   Tathagata Das (tdas AT apache DOT org) (Apache Spark)
   -

   Thomas Graves  (tgraves AT apache DOT org) (Apache Spark)
   -

   Wenchen Fan (wenchen AT apache DOT org) (Apache Spark)
   -

   Yin Huai (yhuai AT apache DOT org) (Apache Spark)
   - Shixiong Zhu (zsxwing AT apache DOT org) (Apache Spark)



BTW, It would be really good to have you on the PMC as well, and any others
that volunteer based on the criteria above. May I add you as PMC to the new
project proposal ?



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Reynold Xin

First, really thank you for leading the discussion.

I am concerned that it'd hurt Spark more than it helps. As many others have
pointed out, this unnecessarily creates a new tier of connectors or 3rd
party libraries appearing to be endorsed by the Spark PMC or the ASF. We
can alleviate this concern by not having "Spark" in the name, and the
project proposal and documentation should label clearly that this is not
affiliated with Spark.

Also Luciano - assuming you are interested in creating a project like this
and find a home for the connectors that were removed, I find it surprising
that few of the initially proposed PMC members have actually contributed
much to the connectors, and people that have contributed a lot were left
out. I am sure that is just an oversight.



On Sat, Apr 16, 2016 at 10:42 PM, Luciano Resende 
wrote:

>
>
> On Sat, Apr 16, 2016 at 5:38 PM, Evan Chan 
> wrote:
>
>> Hi folks,
>>
>> Sorry to join the discussion late.  I had a look at the design doc
>> earlier in this thread, and it was not mentioned what types of
>> projects are the targets of this new "spark extras" ASF umbrella
>>
>> Is the desire to have a maintained set of spark-related projects that
>> keep pace with the main Spark development schedule?  Is it just for
>> streaming connectors?  what about data sources, and other important
>> projects in the Spark ecosystem?
>>
>
> The proposal draft below has some more details on what type of projects,
> but in summary, "Spark-Extras" would be a good place for any of these
> components you mentioned.
>
>
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
>
>
>>
>> I'm worried that this would relegate spark-packages to third tier
>> status,
>
>
> Owen answered a similar question about spark-packages earlier on this
> thread, but while "Spark-Extras" would a place in Apache for collaboration
> on the development of these extensions, they might still be published to
> spark-packages as they existing streaming connectors are today.
>
>
>> and the promotion of a select set of committers, and the
>> project itself, to top level ASF status (a la Arrow) would create a
>> further split in the community.
>>
>>
> As for the select set of committers, we have invited all Spark committers
> to be committers on the project, and I have updated the project proposal
> with the existing set of active Spark committers ( that have committed in
> the last one year)
>
>
>>
>> -Evan
>>
>> On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran 
>> wrote:
>> >
>> >
>> >
>> >
>> >
>> > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>> >
>> >>Yeah in support of this statement I think that my primary interest in
>> >>this Spark Extras and the good work by Luciano here is that anytime we
>> >>take bits out of a code base and “move it to GitHub” I see a bad
>> precedent
>> >>being set.
>> >>
>> >>Creating this project at the ASF creates a synergy between *Apache
>> Spark*
>> >>which is *at the ASF*.
>> >>
>> >>We welcome comments and as Luciano said, this is meant to invite and be
>> >>open to those in the Apache Spark PMC to join and help.
>> >>
>> >>Cheers,
>> >>Chris
>> >
>> > As one of the people named, here's my rationale:
>> >
>> > Throwing stuff into github creates that world of branches, and its no
>> longer something that could be managed through the ASF, where managed is:
>> governance, participation and a release process that includes auditing
>> dependencies, code-signoff, etc,
>> >
>> >
>> > As an example, there's a mutant hive JAR which spark uses, that's
>> something which currently evolved between my repo and Patrick Wendell's;
>> now that Josh Rosen has taken on the bold task of "trying to move spark and
>> twill to Kryo 3", he's going to own that code, and now the reference branch
>> will move somewhere else.
>> >
>> > In contrast, if there was an ASF location for this, then it'd be
>> something anyone with commit rights could maintain and publish
>> >
>> > (actually, I've just realised life is hard here as the hive is a fork
>> of ASF hive —really the spark branch should be a separate branch in Hive's
>> own repo ... But the concept is the same: those bits of the codebase which
>> are core parts of the spark project should really live in or near it)
>> >
>> >
>> > If everyone on the spark commit list gets write access to this extras
>> repo, moving things is straightforward. Release wise, things could/should
>> be in sync.
>> >
>> > If there's a risk, its the eternal problem of the contrib/ dir 
>> Stuff ends up there that never gets maintained. I don't see that being any
>> worse than if things were thrown to the wind of a thousand github repos: at
>> least now there'd be a central issue tracking location.
>>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Luciano Resende

On Sat, Apr 16, 2016 at 5:38 PM, Evan Chan  wrote:

> Hi folks,
>
> Sorry to join the discussion late.  I had a look at the design doc
> earlier in this thread, and it was not mentioned what types of
> projects are the targets of this new "spark extras" ASF umbrella
>
> Is the desire to have a maintained set of spark-related projects that
> keep pace with the main Spark development schedule?  Is it just for
> streaming connectors?  what about data sources, and other important
> projects in the Spark ecosystem?
>

The proposal draft below has some more details on what type of projects,
but in summary, "Spark-Extras" would be a good place for any of these
components you mentioned.

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing


>
> I'm worried that this would relegate spark-packages to third tier
> status,


Owen answered a similar question about spark-packages earlier on this
thread, but while "Spark-Extras" would a place in Apache for collaboration
on the development of these extensions, they might still be published to
spark-packages as they existing streaming connectors are today.


> and the promotion of a select set of committers, and the
> project itself, to top level ASF status (a la Arrow) would create a
> further split in the community.
>
>
As for the select set of committers, we have invited all Spark committers
to be committers on the project, and I have updated the project proposal
with the existing set of active Spark committers ( that have committed in
the last one year)


>
> -Evan
>
> On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran 
> wrote:
> >
> >
> >
> >
> >
> > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >>Yeah in support of this statement I think that my primary interest in
> >>this Spark Extras and the good work by Luciano here is that anytime we
> >>take bits out of a code base and “move it to GitHub” I see a bad
> precedent
> >>being set.
> >>
> >>Creating this project at the ASF creates a synergy between *Apache Spark*
> >>which is *at the ASF*.
> >>
> >>We welcome comments and as Luciano said, this is meant to invite and be
> >>open to those in the Apache Spark PMC to join and help.
> >>
> >>Cheers,
> >>Chris
> >
> > As one of the people named, here's my rationale:
> >
> > Throwing stuff into github creates that world of branches, and its no
> longer something that could be managed through the ASF, where managed is:
> governance, participation and a release process that includes auditing
> dependencies, code-signoff, etc,
> >
> >
> > As an example, there's a mutant hive JAR which spark uses, that's
> something which currently evolved between my repo and Patrick Wendell's;
> now that Josh Rosen has taken on the bold task of "trying to move spark and
> twill to Kryo 3", he's going to own that code, and now the reference branch
> will move somewhere else.
> >
> > In contrast, if there was an ASF location for this, then it'd be
> something anyone with commit rights could maintain and publish
> >
> > (actually, I've just realised life is hard here as the hive is a fork of
> ASF hive —really the spark branch should be a separate branch in Hive's own
> repo ... But the concept is the same: those bits of the codebase which are
> core parts of the spark project should really live in or near it)
> >
> >
> > If everyone on the spark commit list gets write access to this extras
> repo, moving things is straightforward. Release wise, things could/should
> be in sync.
> >
> > If there's a risk, its the eternal problem of the contrib/ dir 
> Stuff ends up there that never gets maintained. I don't see that being any
> worse than if things were thrown to the wind of a thousand github repos: at
> least now there'd be a central issue tracking location.
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Evan Chan

Hi folks,

Sorry to join the discussion late.  I had a look at the design doc
earlier in this thread, and it was not mentioned what types of
projects are the targets of this new "spark extras" ASF umbrella

Is the desire to have a maintained set of spark-related projects that
keep pace with the main Spark development schedule?  Is it just for
streaming connectors?  what about data sources, and other important
projects in the Spark ecosystem?

I'm worried that this would relegate spark-packages to third tier
status, and the promotion of a select set of committers, and the
project itself, to top level ASF status (a la Arrow) would create a
further split in the community.

-Evan

On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran  wrote:
>
>
>
>
>
> On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" 
>  wrote:
>
>>Yeah in support of this statement I think that my primary interest in
>>this Spark Extras and the good work by Luciano here is that anytime we
>>take bits out of a code base and “move it to GitHub” I see a bad precedent
>>being set.
>>
>>Creating this project at the ASF creates a synergy between *Apache Spark*
>>which is *at the ASF*.
>>
>>We welcome comments and as Luciano said, this is meant to invite and be
>>open to those in the Apache Spark PMC to join and help.
>>
>>Cheers,
>>Chris
>
> As one of the people named, here's my rationale:
>
> Throwing stuff into github creates that world of branches, and its no longer 
> something that could be managed through the ASF, where managed is: 
> governance, participation and a release process that includes auditing 
> dependencies, code-signoff, etc,
>
>
> As an example, there's a mutant hive JAR which spark uses, that's something 
> which currently evolved between my repo and Patrick Wendell's; now that Josh 
> Rosen has taken on the bold task of "trying to move spark and twill to Kryo 
> 3", he's going to own that code, and now the reference branch will move 
> somewhere else.
>
> In contrast, if there was an ASF location for this, then it'd be something 
> anyone with commit rights could maintain and publish
>
> (actually, I've just realised life is hard here as the hive is a fork of ASF 
> hive —really the spark branch should be a separate branch in Hive's own repo 
> ... But the concept is the same: those bits of the codebase which are core 
> parts of the spark project should really live in or near it)
>
>
> If everyone on the spark commit list gets write access to this extras repo, 
> moving things is straightforward. Release wise, things could/should be in 
> sync.
>
> If there's a risk, its the eternal problem of the contrib/ dir  Stuff 
> ends up there that never gets maintained. I don't see that being any worse 
> than if things were thrown to the wind of a thousand github repos: at least 
> now there'd be a central issue tracking location.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Steve Loughran

On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" 
 wrote:

>Yeah in support of this statement I think that my primary interest in
>this Spark Extras and the good work by Luciano here is that anytime we
>take bits out of a code base and “move it to GitHub” I see a bad precedent
>being set.
>
>Creating this project at the ASF creates a synergy between *Apache Spark*
>which is *at the ASF*.
>
>We welcome comments and as Luciano said, this is meant to invite and be
>open to those in the Apache Spark PMC to join and help.
>
>Cheers,
>Chris

As one of the people named, here's my rationale:

Throwing stuff into github creates that world of branches, and its no longer 
something that could be managed through the ASF, where managed is: governance, 
participation and a release process that includes auditing dependencies, 
code-signoff, etc,

As an example, there's a mutant hive JAR which spark uses, that's something 
which currently evolved between my repo and Patrick Wendell's; now that Josh 
Rosen has taken on the bold task of "trying to move spark and twill to Kryo 3", 
he's going to own that code, and now the reference branch will move somewhere 
else.

In contrast, if there was an ASF location for this, then it'd be something 
anyone with commit rights could maintain and publish

(actually, I've just realised life is hard here as the hive is a fork of ASF 
hive —really the spark branch should be a separate branch in Hive's own repo 
... But the concept is the same: those bits of the codebase which are core 
parts of the spark project should really live in or near it)

If everyone on the spark commit list gets write access to this extras repo, 
moving things is straightforward. Release wise, things could/should be in sync.

If there's a risk, its the eternal problem of the contrib/ dir  Stuff ends 
up there that never gets maintained. I don't see that being any worse than if 
things were thrown to the wind of a thousand github repos: at least now there'd 
be a central issue tracking location.

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mridul Muralidharan

On Friday, April 15, 2016, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yeah in support of this statement I think that my primary interest in
> this Spark Extras and the good work by Luciano here is that anytime we
> take bits out of a code base and “move it to GitHub” I see a bad precedent
> being set.


Can't agree more !



>
> Creating this project at the ASF creates a synergy between *Apache Spark*
> which is *at the ASF*.


In addition, this will give all the "goodness " of being an Apache project
from a user/consumer point of view compared to a general  github project.




>
> We welcome comments and as Luciano said, this is meant to invite and be
> open to those in the Apache Spark PMC to join and help.
>
>
This would definitely be something worthwhile to explore.
+1

Regards
Mridul



> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov 
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 4/15/16, 9:39 AM, "Luciano Resende"  > wrote:
>
> >
> >
> >On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
> >> wrote:
> >
> >Given that not all of the connectors were removed, I think this
> >creates a weird / confusing three tier system
> >
> >1. connectors in the official project's spark/extras or spark/external
> >2. connectors in "Spark Extras"
> >3. connectors in some random organization's github
> >
> >
> >
> >
> >
> >
> >
> >Agree Cody, and I think this is one of the goals of "Spark Extras",
> centralize the development of these connectors under one central place at
> Apache, and that's why one of our asks is to invite the Spark PMC to
> continue developing the remaining connectors
> > that stayed in Spark proper, in "Spark Extras". We will also discuss
> some process policies on enabling lowering the bar to allow proposal of
> these other github extensions to be part of "Spark Extras" while also
> considering a way to move code to a maintenance
> > mode location.
> >
> >
> >
> >
> >--
> >Luciano Resende
> >http://twitter.com/lresende1975
> >http://lresende.blogspot.com/
> >
> >
> >
> >
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Cody Koeninger

100% agree with Sean & Reynold's comments on this.

Adding this as a TLP would just cause more confusion as to "official"
endorsement.



On Fri, Apr 15, 2016 at 11:50 AM, Sean Owen  wrote:
> On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende  wrote:
>> I know the name might be confusing, but I also think that the projects have
>> a very big synergy, more like sibling projects, where "Spark Extras" extends
>> the Spark community and develop/maintain components for, and pretty much
>> only for, Apache Spark.  Based on your comment above, if making the project
>> "Spark-Extras" a more acceptable name, I believe this is ok as well.
>
> This also grants special status to a third-party project. It's not
> clear this should be *the* official unofficial third-party Spark
> project over some other one. If something's to be blessed, it should
> be in the Spark project.
>
> And why isn't it in the Spark project? the argument was that these
> bits were not used and pretty de minimis as code. It's not up to me or
> anyone else to tell you code X isn't useful to you. But arguing X
> should be a TLP asserts it is substantial and of broad interest, since
> there's non-zero effort for volunteers to deal with it. I am not sure
> I've heard anyone argue that -- or did I miss it? because removing
> bits of unused code happens all the time and isn't a bad precedent or
> even unusual.
>
> It doesn't actually enable any more cooperation than is already
> possible with any other project (like Kafka, Mesos, etc). You can run
> the same governance model anywhere you like. I realize literally being
> operated under the ASF banner is something different.
>
> What I hear here is a proposal to make an unofficial official Spark
> project as a TLP, that begins with these fairly inconsequential
> extras. I question the value of that on its face. Example: what goes
> into this project? deleted Spark code only? or is this a glorified
> "contrib" folder with a lower and somehow different bar determined by
> different people?
>
> And at that stage... is it really helping to give that special status?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)

Yeah in support of this statement I think that my primary interest in
this Spark Extras and the good work by Luciano here is that anytime we
take bits out of a code base and “move it to GitHub” I see a bad precedent
being set.

Creating this project at the ASF creates a synergy between *Apache Spark*
which is *at the ASF*.

We welcome comments and as Luciano said, this is meant to invite and be
open to those in the Apache Spark PMC to join and help.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/15/16, 9:39 AM, "Luciano Resende"  wrote:

>
>
>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger 
> wrote:
>
>Given that not all of the connectors were removed, I think this
>creates a weird / confusing three tier system
>
>1. connectors in the official project's spark/extras or spark/external
>2. connectors in "Spark Extras"
>3. connectors in some random organization's github
>
>
>
>
>
>
>
>Agree Cody, and I think this is one of the goals of "Spark Extras", centralize 
>the development of these connectors under one central place at Apache, and 
>that's why one of our asks is to invite the Spark PMC to continue developing 
>the remaining connectors
> that stayed in Spark proper, in "Spark Extras". We will also discuss some 
> process policies on enabling lowering the bar to allow proposal of these 
> other github extensions to be part of "Spark Extras" while also considering a 
> way to move code to a maintenance
> mode location.
>
> 
>
>
>-- 
>Luciano Resende
>http://twitter.com/lresende1975
>http://lresende.blogspot.com/
>
>
>
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)

Hey Reynold,

Thanks. Getting to the heart of this, I think that this project would
be successful if the Apache Spark PMC decided to participate and there
was some overlap. As much as I think it would be great to stand up another
project, the goal here from Luciano and crew (myself included) would be
to suggest it’s just as easy to start an Apache Incubator project to 
manage “extra” pieces of Apache Spark code outside of the release cycle
and the other reasons stated that it made sense to move this code out of
the code base. This isn’t a competing effort to some code on GitHub that
was moved out of Apache source control from Apache Spark - it’s meant to 
be an enabler to suggest that code could be managed here just as easily
(see the difference?)

Let me know what you think thanks Reynold.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/15/16, 9:47 AM, "Reynold Xin"  wrote:

>
>
>
>Anybody is free and welcomed to create another ASF project, but I don't think 
>"Spark extras" is a good name. It unnecessarily creates another tier of code 
>that ASF is "endorsing".
>On Friday, April 15, 2016, Mattmann, Chris A (3980) 
> wrote:
>
>Yeah in support of this statement I think that my primary interest in
>this Spark Extras and the good work by Luciano here is that anytime we
>take bits out of a code base and “move it to GitHub” I see a bad precedent
>being set.
>
>Creating this project at the ASF creates a synergy between *Apache Spark*
>which is *at the ASF*.
>
>We welcome comments and as Luciano said, this is meant to invite and be
>open to those in the Apache Spark PMC to join and help.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattm...@nasa.gov 
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>
>On 4/15/16, 9:39 AM, "Luciano Resende" > 
>wrote:
>
>>
>>
>>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
>>> wrote:
>>
>>Given that not all of the connectors were removed, I think this
>>creates a weird / confusing three tier system
>>
>>1. connectors in the official project's spark/extras or spark/external
>>2. connectors in "Spark Extras"
>>3. connectors in some random organization's github
>>
>>
>>
>>
>>
>>
>>
>>Agree Cody, and I think this is one of the goals of "Spark Extras", 
>>centralize the development of these connectors under one central place at 
>>Apache, and that's why one of our asks is to invite the Spark PMC to continue 
>>developing the remaining connectors
>> that stayed in Spark proper, in "Spark Extras". We will also discuss some 
>> process policies on enabling lowering the bar to allow proposal of these 
>> other github extensions to be part of "Spark Extras" while also considering 
>> a way to move code to a maintenance
>> mode location.
>>
>>
>>
>>
>>--
>>Luciano Resende
>>http://twitter.com/lresende1975
>>http://lresende.blogspot.com/
>>
>>
>>
>>
>
>
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)

Yeah, so it’s the *Apache Spark* project. Just to clarify.
Not once did you say Apache Spark below.






On 4/15/16, 9:50 AM, "Sean Owen"  wrote:

>On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende  wrote:
>> I know the name might be confusing, but I also think that the projects have
>> a very big synergy, more like sibling projects, where "Spark Extras" extends
>> the Spark community and develop/maintain components for, and pretty much
>> only for, Apache Spark.  Based on your comment above, if making the project
>> "Spark-Extras" a more acceptable name, I believe this is ok as well.
>
>This also grants special status to a third-party project. It's not
>clear this should be *the* official unofficial third-party Spark
>project over some other one. If something's to be blessed, it should
>be in the Spark project.
>
>And why isn't it in the Spark project? the argument was that these
>bits were not used and pretty de minimis as code. It's not up to me or
>anyone else to tell you code X isn't useful to you. But arguing X
>should be a TLP asserts it is substantial and of broad interest, since
>there's non-zero effort for volunteers to deal with it. I am not sure
>I've heard anyone argue that -- or did I miss it? because removing
>bits of unused code happens all the time and isn't a bad precedent or
>even unusual.
>
>It doesn't actually enable any more cooperation than is already
>possible with any other project (like Kafka, Mesos, etc). You can run
>the same governance model anywhere you like. I realize literally being
>operated under the ASF banner is something different.
>
>What I hear here is a proposal to make an unofficial official Spark
>project as a TLP, that begins with these fairly inconsequential
>extras. I question the value of that on its face. Example: what goes
>into this project? deleted Spark code only? or is this a glorified
>"contrib" folder with a lower and somehow different bar determined by
>different people?
>
>And at that stage... is it really helping to give that special status?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Jean-Baptiste Onofré


+1

Regards
JB

On 04/15/2016 06:41 PM, Mattmann, Chris A (3980) wrote:

Yeah in support of this statement I think that my primary interest in
this Spark Extras and the good work by Luciano here is that anytime we
take bits out of a code base and “move it to GitHub” I see a bad precedent
being set.

Creating this project at the ASF creates a synergy between *Apache Spark*
which is *at the ASF*.

We welcome comments and as Luciano said, this is meant to invite and be
open to those in the Apache Spark PMC to join and help.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/15/16, 9:39 AM, "Luciano Resende"  wrote:




On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
 wrote:

Given that not all of the connectors were removed, I think this
creates a weird / confusing three tier system

1. connectors in the official project's spark/extras or spark/external
2. connectors in "Spark Extras"
3. connectors in some random organization's github







Agree Cody, and I think this is one of the goals of "Spark Extras", centralize 
the development of these connectors under one central place at Apache, and that's why one 
of our asks is to invite the Spark PMC to continue developing the remaining connectors
that stayed in Spark proper, in "Spark Extras". We will also discuss some process 
policies on enabling lowering the bar to allow proposal of these other github extensions to be part 
of "Spark Extras" while also considering a way to move code to a maintenance
mode location.




--
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Sean Owen

On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende  wrote:
> I know the name might be confusing, but I also think that the projects have
> a very big synergy, more like sibling projects, where "Spark Extras" extends
> the Spark community and develop/maintain components for, and pretty much
> only for, Apache Spark.  Based on your comment above, if making the project
> "Spark-Extras" a more acceptable name, I believe this is ok as well.

This also grants special status to a third-party project. It's not
clear this should be *the* official unofficial third-party Spark
project over some other one. If something's to be blessed, it should
be in the Spark project.

And why isn't it in the Spark project? the argument was that these
bits were not used and pretty de minimis as code. It's not up to me or
anyone else to tell you code X isn't useful to you. But arguing X
should be a TLP asserts it is substantial and of broad interest, since
there's non-zero effort for volunteers to deal with it. I am not sure
I've heard anyone argue that -- or did I miss it? because removing
bits of unused code happens all the time and isn't a bad precedent or
even unusual.

It doesn't actually enable any more cooperation than is already
possible with any other project (like Kafka, Mesos, etc). You can run
the same governance model anywhere you like. I realize literally being
operated under the ASF banner is something different.

What I hear here is a proposal to make an unofficial official Spark
project as a TLP, that begins with these fairly inconsequential
extras. I question the value of that on its face. Example: what goes
into this project? deleted Spark code only? or is this a glorified
"contrib" folder with a lower and somehow different bar determined by
different people?

And at that stage... is it really helping to give that special status?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Reynold Xin

Anybody is free and welcomed to create another ASF project, but I don't
think "Spark extras" is a good name. It unnecessarily creates another tier
of code that ASF is "endorsing".

On Friday, April 15, 2016, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yeah in support of this statement I think that my primary interest in
> this Spark Extras and the good work by Luciano here is that anytime we
> take bits out of a code base and “move it to GitHub” I see a bad precedent
> being set.
>
> Creating this project at the ASF creates a synergy between *Apache Spark*
> which is *at the ASF*.
>
> We welcome comments and as Luciano said, this is meant to invite and be
> open to those in the Apache Spark PMC to join and help.
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov 
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 4/15/16, 9:39 AM, "Luciano Resende"  > wrote:
>
> >
> >
> >On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
> >> wrote:
> >
> >Given that not all of the connectors were removed, I think this
> >creates a weird / confusing three tier system
> >
> >1. connectors in the official project's spark/extras or spark/external
> >2. connectors in "Spark Extras"
> >3. connectors in some random organization's github
> >
> >
> >
> >
> >
> >
> >
> >Agree Cody, and I think this is one of the goals of "Spark Extras",
> centralize the development of these connectors under one central place at
> Apache, and that's why one of our asks is to invite the Spark PMC to
> continue developing the remaining connectors
> > that stayed in Spark proper, in "Spark Extras". We will also discuss
> some process policies on enabling lowering the bar to allow proposal of
> these other github extensions to be part of "Spark Extras" while also
> considering a way to move code to a maintenance
> > mode location.
> >
> >
> >
> >
> >--
> >Luciano Resende
> >http://twitter.com/lresende1975
> >http://lresende.blogspot.com/
> >
> >
> >
> >
>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Sean Owen

I think this meant to be understood as a community site, and as a
directory listing pointers to third-party projects. It's not a project
of its own, and not part of Spark itself, with no special status. At
least, I think that's how it should be presented and pretty much seems
to come across that way.

On Fri, Apr 15, 2016 at 5:33 PM, Chris Fregly  wrote:
> and how does this all relate to the existing 1-and-a-half-class citizen
> known as spark-packages.org?
>
> support for this citizen is buried deep in the Spark source (which was
> always a bit odd, in my opinion):
>
> https://github.com/apache/spark/search?utf8=%E2%9C%93&q=spark-packages
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende

On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger  wrote:

> Given that not all of the connectors were removed, I think this
> creates a weird / confusing three tier system
>
> 1. connectors in the official project's spark/extras or spark/external
> 2. connectors in "Spark Extras"
> 3. connectors in some random organization's github
>
>
Agree Cody, and I think this is one of the goals of "Spark Extras",
centralize the development of these connectors under one central place at
Apache, and that's why one of our asks is to invite the Spark PMC to
continue developing the remaining connectors that stayed in Spark proper,
in "Spark Extras". We will also discuss some process policies on enabling
lowering the bar to allow proposal of these other github extensions to be
part of "Spark Extras" while also considering a way to move code to a
maintenance mode location.


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende

On Fri, Apr 15, 2016 at 9:18 AM, Sean Owen  wrote:

> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>

This whole discussion started when some of the connectors were moved from
Apache to Github, which makes a statement that The "Spark Governance" of
the bits is something very valuable by the community, consumers, and other
companies that are consuming open source code. Being an Apache project also
allows the project to use and share the Apache infrastructure to run the
project.


>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
I know the name might be confusing, but I also think that the projects have
a very big synergy, more like sibling projects, where "Spark Extras"
extends the Spark community and develop/maintain components for, and pretty
much only for, Apache Spark.  Based on your comment above, if making the
project "Spark-Extras" a more acceptable name, I believe this is ok as well.

I also understand that the Spark PMC might have concerns with branding, and
that's why we are inviting all members of the Spark PMC to join the project
and help oversee and manage the project.



>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende 
> wrote:
> > After some collaboration with other community members, we have created a
> > initial draft for Spark Extras which is available for review at
> >
> >
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
> >
> > We would like to invite other community members to participate in the
> > project, particularly the Spark Committers and PMC (feel free to express
> > interest and I will update the proposal). Another option here is just to
> > give ALL Spark committers write access to "Spark Extras".
> >
> >
> > We also have couple asks from the Spark PMC :
> >
> > - Permission to use "Spark Extras" as the project name. We already
> checked
> > this with Apache Brand Management, and the recommendation was to discuss
> and
> > reach consensus with the Spark PMC.
> >
> > - We would also want to check with the Spark PMC that, in case of
> > successfully creation of  "Spark Extras", if the PMC would be willing to
> > continue the development of the remaining connectors that stayed in Spark
> > 2.0 codebase in the "Spark Extras" project.
> >
> >
> > Thanks in advance, and we welcome any feedback around this proposal
> before
> > we present to the Apache Board for consideration.
> >
> >
> >
> > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
> > wrote:
> >>
> >> I believe some of this has been resolved in the context of some parts
> that
> >> had interest in one extra connector, but we still have a few removed,
> and as
> >> you mentioned, we still don't have a simple way or willingness to
> manage and
> >> be current on new packages like kafka. And based on the fact that this
> >> thread is still alive, I believe that other community members might have
> >> other concerns as well.
> >>
> >> After some thought, I believe having a separate project (what was
> >> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons
> >> in general could be very beneficial to Spark and the overall Spark
> >> community, which would have a central place in Apache to collaborate
> around
> >> related Spark components.
> >>
> >> Some of the benefits on this approach
> >>
> >> - Enables maintaining the connectors inside Apache, following the Apache
> >> governance and release rules, while allowing Spark proper to focus on
> the
> >> core runtime.
> >> - Provides more flexibility in controlling the direction (currency) of
> the
> >> existing connectors (e.g. willing to find a solution and maintain
> multiple
> >> versions of same connectors like kafka 0.8x and 0.9x)
> >> - Becomes a home for other types of Spark related connectors helping
> >> expanding the community around Spark (e.g. Zeppelin see most of it's
> current
> >> contribution around new/enhanced connectors)
> >>
> >> What are some requirements for Spark Extras to be successful:
> >>
> >> - Be up to date with Spark Trunk APIs (based on daily CIs against
> >> SNAPSHOT)
> >> - Adhere to Spark release cycles (have a very little window compared to
> >> Spark release)
> >> - Be more open and flexible to the set of connectors it will accept and
> >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> >> have today)
> >>
> >> Where to start Spark Extras
> >>
> >> Depending on the inter

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Cody Koeninger

Given that not all of the connectors were removed, I think this
creates a weird / confusing three tier system

1. connectors in the official project's spark/extras or spark/external
2. connectors in "Spark Extras"
3. connectors in some random organization's github



On Fri, Apr 15, 2016 at 11:18 AM, Sean Owen  wrote:
> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende  wrote:
>> After some collaboration with other community members, we have created a
>> initial draft for Spark Extras which is available for review at
>>
>> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
>>
>> We would like to invite other community members to participate in the
>> project, particularly the Spark Committers and PMC (feel free to express
>> interest and I will update the proposal). Another option here is just to
>> give ALL Spark committers write access to "Spark Extras".
>>
>>
>> We also have couple asks from the Spark PMC :
>>
>> - Permission to use "Spark Extras" as the project name. We already checked
>> this with Apache Brand Management, and the recommendation was to discuss and
>> reach consensus with the Spark PMC.
>>
>> - We would also want to check with the Spark PMC that, in case of
>> successfully creation of  "Spark Extras", if the PMC would be willing to
>> continue the development of the remaining connectors that stayed in Spark
>> 2.0 codebase in the "Spark Extras" project.
>>
>>
>> Thanks in advance, and we welcome any feedback around this proposal before
>> we present to the Apache Board for consideration.
>>
>>
>>
>> On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
>> wrote:
>>>
>>> I believe some of this has been resolved in the context of some parts that
>>> had interest in one extra connector, but we still have a few removed, and as
>>> you mentioned, we still don't have a simple way or willingness to manage and
>>> be current on new packages like kafka. And based on the fact that this
>>> thread is still alive, I believe that other community members might have
>>> other concerns as well.
>>>
>>> After some thought, I believe having a separate project (what was
>>> mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons
>>> in general could be very beneficial to Spark and the overall Spark
>>> community, which would have a central place in Apache to collaborate around
>>> related Spark components.
>>>
>>> Some of the benefits on this approach
>>>
>>> - Enables maintaining the connectors inside Apache, following the Apache
>>> governance and release rules, while allowing Spark proper to focus on the
>>> core runtime.
>>> - Provides more flexibility in controlling the direction (currency) of the
>>> existing connectors (e.g. willing to find a solution and maintain multiple
>>> versions of same connectors like kafka 0.8x and 0.9x)
>>> - Becomes a home for other types of Spark related connectors helping
>>> expanding the community around Spark (e.g. Zeppelin see most of it's current
>>> contribution around new/enhanced connectors)
>>>
>>> What are some requirements for Spark Extras to be successful:
>>>
>>> - Be up to date with Spark Trunk APIs (based on daily CIs against
>>> SNAPSHOT)
>>> - Adhere to Spark release cycles (have a very little window compared to
>>> Spark release)
>>> - Be more open and flexible to the set of connectors it will accept and
>>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
>>> have today)
>>>
>>> Where to start Spark Extras
>>>
>>> Depending on the interest here, we could follow the steps of (Apache
>>> Arrow) and start this directly as a TLP, or start as an incubator project. I
>>> would consider the first option first.
>>>
>>> Who would participate
>>>
>>> Have thought about this for a bit, and if we go to the direction of TLP, I
>>> would say Spark Committers and Apache Members can request to participate as
>>> PMC members, while other committers can request to become committers. Non
>>> committers would be added based on meritocracy after the start of the
>>> project.
>>>
>>> Project Name
>>>
>>> It would be ideal if we could have a project name that shows close ties to
>>> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
>>> and support from whoever is going to evaluate the project proposal (e

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Chris Fregly

and how does this all relate to the existing 1-and-a-half-class citizen
known as spark-packages.org?

support for this citizen is buried deep in the Spark source (which was
always a bit odd, in my opinion):

https://github.com/apache/spark/search?utf8=%E2%9C%93&q=spark-packages


On Fri, Apr 15, 2016 at 12:18 PM, Sean Owen  wrote:

> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende 
> wrote:
> > After some collaboration with other community members, we have created a
> > initial draft for Spark Extras which is available for review at
> >
> >
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
> >
> > We would like to invite other community members to participate in the
> > project, particularly the Spark Committers and PMC (feel free to express
> > interest and I will update the proposal). Another option here is just to
> > give ALL Spark committers write access to "Spark Extras".
> >
> >
> > We also have couple asks from the Spark PMC :
> >
> > - Permission to use "Spark Extras" as the project name. We already
> checked
> > this with Apache Brand Management, and the recommendation was to discuss
> and
> > reach consensus with the Spark PMC.
> >
> > - We would also want to check with the Spark PMC that, in case of
> > successfully creation of  "Spark Extras", if the PMC would be willing to
> > continue the development of the remaining connectors that stayed in Spark
> > 2.0 codebase in the "Spark Extras" project.
> >
> >
> > Thanks in advance, and we welcome any feedback around this proposal
> before
> > we present to the Apache Board for consideration.
> >
> >
> >
> > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
> > wrote:
> >>
> >> I believe some of this has been resolved in the context of some parts
> that
> >> had interest in one extra connector, but we still have a few removed,
> and as
> >> you mentioned, we still don't have a simple way or willingness to
> manage and
> >> be current on new packages like kafka. And based on the fact that this
> >> thread is still alive, I believe that other community members might have
> >> other concerns as well.
> >>
> >> After some thought, I believe having a separate project (what was
> >> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons
> >> in general could be very beneficial to Spark and the overall Spark
> >> community, which would have a central place in Apache to collaborate
> around
> >> related Spark components.
> >>
> >> Some of the benefits on this approach
> >>
> >> - Enables maintaining the connectors inside Apache, following the Apache
> >> governance and release rules, while allowing Spark proper to focus on
> the
> >> core runtime.
> >> - Provides more flexibility in controlling the direction (currency) of
> the
> >> existing connectors (e.g. willing to find a solution and maintain
> multiple
> >> versions of same connectors like kafka 0.8x and 0.9x)
> >> - Becomes a home for other types of Spark related connectors helping
> >> expanding the community around Spark (e.g. Zeppelin see most of it's
> current
> >> contribution around new/enhanced connectors)
> >>
> >> What are some requirements for Spark Extras to be successful:
> >>
> >> - Be up to date with Spark Trunk APIs (based on daily CIs against
> >> SNAPSHOT)
> >> - Adhere to Spark release cycles (have a very little window compared to
> >> Spark release)
> >> - Be more open and flexible to the set of connectors it will accept and
> >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> >> have today)
> >>
> >> Where to start Spark Extras
> >>
> >> Depending on the interest here, we could follow the steps of (Apache
> >> Arrow) and start this directly as a TLP, or start as an incubator
> project. I
> >> would consider the first option first.
> >>
> >> Who would participate
> >>
> >> Have thought about this for a bit, and if we go to the direction of
> TLP, I
> >> would say Spark Committers and Apache Members can request to
> participate as
> >> PMC members, while other committers can request to become committers.
> Non
> >> committers would be added based on meritocracy after the start of the
> >> project.
> >>
> >> Project Name
> >>
> >> It would be ideal if we could have a project name that shows close ties
> to
> >> Spark (e.g.

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Sean Owen

Why would this need to be an ASF project of its own? I don't think
it's possible to have a yet another separate "Spark Extras" TLP (?)

There is already a project to manage these bits of code on Github. How
about all of the interested parties manage the code there, under the
same process, under the same license, etc?

I'm not against calling it Spark Extras myself but I wonder if that
needlessly confuses the situation. They aren't part of the Spark TLP
on purpose, so trying to give it some special middle-ground status
might just be confusing. The thing that comes to mind immediately is
"Connectors for Apache Spark", spark-connectors, etc.


On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende  wrote:
> After some collaboration with other community members, we have created a
> initial draft for Spark Extras which is available for review at
>
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
>
> We would like to invite other community members to participate in the
> project, particularly the Spark Committers and PMC (feel free to express
> interest and I will update the proposal). Another option here is just to
> give ALL Spark committers write access to "Spark Extras".
>
>
> We also have couple asks from the Spark PMC :
>
> - Permission to use "Spark Extras" as the project name. We already checked
> this with Apache Brand Management, and the recommendation was to discuss and
> reach consensus with the Spark PMC.
>
> - We would also want to check with the Spark PMC that, in case of
> successfully creation of  "Spark Extras", if the PMC would be willing to
> continue the development of the remaining connectors that stayed in Spark
> 2.0 codebase in the "Spark Extras" project.
>
>
> Thanks in advance, and we welcome any feedback around this proposal before
> we present to the Apache Board for consideration.
>
>
>
> On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
> wrote:
>>
>> I believe some of this has been resolved in the context of some parts that
>> had interest in one extra connector, but we still have a few removed, and as
>> you mentioned, we still don't have a simple way or willingness to manage and
>> be current on new packages like kafka. And based on the fact that this
>> thread is still alive, I believe that other community members might have
>> other concerns as well.
>>
>> After some thought, I believe having a separate project (what was
>> mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons
>> in general could be very beneficial to Spark and the overall Spark
>> community, which would have a central place in Apache to collaborate around
>> related Spark components.
>>
>> Some of the benefits on this approach
>>
>> - Enables maintaining the connectors inside Apache, following the Apache
>> governance and release rules, while allowing Spark proper to focus on the
>> core runtime.
>> - Provides more flexibility in controlling the direction (currency) of the
>> existing connectors (e.g. willing to find a solution and maintain multiple
>> versions of same connectors like kafka 0.8x and 0.9x)
>> - Becomes a home for other types of Spark related connectors helping
>> expanding the community around Spark (e.g. Zeppelin see most of it's current
>> contribution around new/enhanced connectors)
>>
>> What are some requirements for Spark Extras to be successful:
>>
>> - Be up to date with Spark Trunk APIs (based on daily CIs against
>> SNAPSHOT)
>> - Adhere to Spark release cycles (have a very little window compared to
>> Spark release)
>> - Be more open and flexible to the set of connectors it will accept and
>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
>> have today)
>>
>> Where to start Spark Extras
>>
>> Depending on the interest here, we could follow the steps of (Apache
>> Arrow) and start this directly as a TLP, or start as an incubator project. I
>> would consider the first option first.
>>
>> Who would participate
>>
>> Have thought about this for a bit, and if we go to the direction of TLP, I
>> would say Spark Committers and Apache Members can request to participate as
>> PMC members, while other committers can request to become committers. Non
>> committers would be added based on meritocracy after the start of the
>> project.
>>
>> Project Name
>>
>> It would be ideal if we could have a project name that shows close ties to
>> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
>> and support from whoever is going to evaluate the project proposal (e.g.
>> Apache Board)
>>
>>
>> Thoughts ?
>>
>> Does anyone have any big disagreement or objection to moving into this
>> direction ?
>>
>> Otherwise, who would be interested in joining the project, so I can start
>> working on some concrete proposal ?
>>
>>
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/

-
To un

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende

After some collaboration with other community members, we have created a
initial draft for Spark Extras which is available for review at

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing

We would like to invite other community members to participate in the
project, particularly the Spark Committers and PMC (feel free to express
interest and I will update the proposal). Another option here is just to
give ALL Spark committers write access to "Spark Extras".


We also have couple asks from the Spark PMC :

- Permission to use "Spark Extras" as the project name. We already checked
this with Apache Brand Management, and the recommendation was to discuss
and reach consensus with the Spark PMC.

- We would also want to check with the Spark PMC that, in case of
successfully creation of  "Spark Extras", if the PMC would be willing to
continue the development of the remaining connectors that stayed in Spark
2.0 codebase in the "Spark Extras" project.


Thanks in advance, and we welcome any feedback around this proposal before
we present to the Apache Board for consideration.



On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
wrote:

> I believe some of this has been resolved in the context of some parts that
> had interest in one extra connector, but we still have a few removed, and
> as you mentioned, we still don't have a simple way or willingness to manage
> and be current on new packages like kafka. And based on the fact that this
> thread is still alive, I believe that other community members might have
> other concerns as well.
>
> After some thought, I believe having a separate project (what was
> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons in general could be very beneficial to Spark and the overall Spark
> community, which would have a central place in Apache to collaborate around
> related Spark components.
>
> Some of the benefits on this approach
>
> - Enables maintaining the connectors inside Apache, following the Apache
> governance and release rules, while allowing Spark proper to focus on the
> core runtime.
> - Provides more flexibility in controlling the direction (currency) of the
> existing connectors (e.g. willing to find a solution and maintain multiple
> versions of same connectors like kafka 0.8x and 0.9x)
> - Becomes a home for other types of Spark related connectors helping
> expanding the community around Spark (e.g. Zeppelin see most of it's
> current contribution around new/enhanced connectors)
>
> What are some requirements for Spark Extras to be successful:
>
> - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT)
> - Adhere to Spark release cycles (have a very little window compared to
> Spark release)
> - Be more open and flexible to the set of connectors it will accept and
> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> have today)
>
> Where to start Spark Extras
>
> Depending on the interest here, we could follow the steps of (Apache
> Arrow) and start this directly as a TLP, or start as an incubator project.
> I would consider the first option first.
>
> Who would participate
>
> Have thought about this for a bit, and if we go to the direction of TLP, I
> would say Spark Committers and Apache Members can request to participate as
> PMC members, while other committers can request to become committers. Non
> committers would be added based on meritocracy after the start of the
> project.
>
> Project Name
>
> It would be ideal if we could have a project name that shows close ties to
> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
> and support from whoever is going to evaluate the project proposal (e.g.
> Apache Board)
>
>
> Thoughts ?
>
> Does anyone have any big disagreement or objection to moving into this
> direction ?
>
> Otherwise, who would be interested in joining the project, so I can start
> working on some concrete proposal ?
>
>
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

2016-03-28 Thread Cody Koeninger

Are you talking about group/identifier name, or contained classes?

Because there are plenty of org.apache.* classes distributed via maven
with non-apache group / identifiers.

On Fri, Mar 25, 2016 at 6:54 PM, David Nalley  wrote:
>
>> As far as group / artifact name compatibility, at least in the case of
>> Kafka we need different artifact names anyway, and people are going to
>> have to make changes to their build files for spark 2.0 anyway.   As
>> far as keeping the actual classes in org.apache.spark to not break
>> code despite the group name being different, I don't know whether that
>> would be enforced by maven central, just looked at as poor taste, or
>> ASF suing for trademark violation :)
>
>
> Sonatype, has strict instructions to only permit org.apache.* to originate 
> from repository.apache.org. Exceptions to that must be approved by VP, 
> Infrastructure.
> --
> Sent via Pony Mail for dev@spark.apache.org.
> View this email online at:
> https://pony-poc.apache.org/list.html?dev@spark.apache.org
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Mridul Muralidharan

On Saturday, March 26, 2016, Sean Owen  wrote:

> This has been resolved; see the JIRA and related PRs but also
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html
>
>
This change happened subsequent to current thread (thanks Marcelo) and
could as well have gone unnoticed until release vote.



> This is not a scenario where a [VOTE] needs to take place, and code
> changes don't proceed through PMC votes. From the project perspective,
> code was deleted/retired for lack of interest, and this is controlled
> by the normal lazy consensus protocol which wasn't vetoed.


I have not seen Apache owned artifacts moved out of it's governance without
discussion - this was not refactoring or cleanup (as was suggested
disingenuously) but migration of submodules/functionality (though from
Reynold's clarification, looks like for good enough reasons).

A vote might or might not have required but a discussion must have
happened - atleast going forward, it will help us not to miss things
(artifact and project namespace, license, ownership, release cycle, version
compatibility, etc of the sub project could be of interest to users and
developers).

Regards
Mridul


> The subsequent discussion was in part about whether other modules
> should go, or whether one should come back, which it did. The latter
> suggests that change could have been left open for some discussion
> longer. Ideally, you would have commented before the initial change
> happened, but it sounds like several people would have liked more
> time. I don't think I'd call that "improper conduct" though, no. It
> was reversed via the same normal code management process.
>
> The rest of the question concerned what becomes of the code that was
> removed. It was revived outside the project for anyone who cares to
> continue collaborating. There seemed to be no disagreement about that,
> mostly because the code in question was of minimal interest. PMC
> doesn't need to rule on anything. There may still be some loose ends
> there like namespace changes. I'll add to the other thread about this.
>
>
>
> On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski  > wrote:
> > Hi,
> >
> > Although I'm not that much experienced member of ASF, I share your
> > concerns. I haven't looked at the issue from this point of view, but
> > after having read the thread I think PMC should've signed off the
> > migration of ASF-owned code to a non-ASF repo. At least a vote is
> > required (and this discussion is a sign that the process has not been
> > conducted properly as people have concerns, me including).
> >
> > Thanks Mridul!
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan  > wrote:
> >> I am not referring to code edits - but to migrating submodules and
> >> code currently in Apache Spark to 'outside' of it.
> >> If I understand correctly, assets from Apache Spark are being moved
> >> out of it into thirdparty external repositories - not owned by Apache.
> >>
> >> At a minimum, dev@ discussion (like this one) should be initiated.
> >> As PMC is responsible for the project assets (including code), signoff
> >> is required for it IMO.
> >>
> >> More experienced Apache members might be opine better in case I got it
> wrong !
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  > wrote:
> >>> Why would a PMC vote be necessary on every code deletion?
> >>>
> >>> There was a Jira and pull request discussion about the submodules that
> >>> have been removed so far.
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-13843
> >>>
> >>> There's another ongoing one about Kafka specifically
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-13877
> >>>
> >>>
> >>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  > wrote:
> 
>  I was not aware of a discussion in Dev list about this - agree with
> most of
>  the observations.
>  In addition, I did not see PMC signoff on moving (sub-)modules out.
> 
>  Regards
>  Mridul
> 
> 
> 
>  On Thursday, March 17, 2016, Marcelo Vanzin  > wrote:
> >
> > Hello all,
> >
> > Recently a lot of the streaming backends were moved to a separate
> > project on github and removed from the main Spark repo.
> >
> > While I think the idea is great, I'm a little worried about the
> > execution. Some concerns were already raised on the bug mentioned
> > above, but I'd like to have a more explicit discussion about this so
> > things don't fall through the cracks.
> >
> > Mainly I have three concerns.
> >
> > i. Ownership
> >
> > That code used to be run by the ASF, but now it's hosted in a github
> > repo owned not by the ASF. That sounds a lit

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Jean-Baptiste Onofré


Hi Luciano,

I didn't mean Spark proper, but more something like you proposed.

Regards
JB

On 03/26/2016 06:38 PM, Luciano Resende wrote:



On Sat, Mar 26, 2016 at 10:20 AM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote:

Hi Luciano,

If we take the "pure" technical vision, there's pros and cons of
having spark-extra (or whatever the name we give) still as an Apache
project:

Pro:
  - Governance & Quality Insurance: we follow the Apache rules,
meaning that a release has to be staged and voted by the PMC. It's a
form of governance of the project and quality (as the releases are
reviewed).
  - Software origin: users know where the connector comes from, and
they have the guarantee in term of licensing, etc.
  - IP/ICLA: We know the committers of this project, and we know
they agree with the ICL agreement.

Cons:
  - Third licenses support. As an Apache project, the "connectors"
will be allowed to use only Apache or Category B licensed
dependencies. For instance, if I would like to create a Spark
connector for couchbase, I can't do it at Apache.


Yes, this is not solving the incompatible license problems

  - Release cycle. As an Apache project, it means we have to follow
the rules, meaning that the release cycle can appear strict and long
due to the staging and vote process. For me, it's a huge benefit but
some can see as too strict ;)


IMHO, This is the small price we pay for all the good stuff you
mentioned in pro


Maybe, we can imagine both, as we have in ServiceMix or Camel:
- all modules/connectors matching the Apache rule (especially in
term of licensing) should be in the Apache Spark-Modules (or
Spark-Extensions, or whatever). It's like the ServiceMix Bundles.


If you are talking here about Spark proper, then we are currently seeing
that this is going to be hard. If there was a way to have more
flexibility to host these directly into Spark proper, I would never be
creating this thread as we would have all the pros you mentioned hosting
them directly into Spark.

- all modules/connectors that can't fit into the Apache rule (due to
licensing issue) can go into GitHub Spark-Extra (or Spark-Package).
It's like the ServiceMix Extra or Camel Extra on github.


We could look into this, but it might be a "Spark Extra  discussion" on
how we can help foster a community around the non-compatible licensed
connectors.

My $0.01.

Regards
JB


On 03/26/2016 06:07 PM, Luciano Resende wrote:

I believe some of this has been resolved in the context of some
parts
that had interest in one extra connector, but we still have a few
removed, and as you mentioned, we still don't have a simple way or
willingness to manage and be current on new packages like kafka. And
based on the fact that this thread is still alive, I believe
that other
community members might have other concerns as well.

After some thought, I believe having a separate project (what was
mentioned here as Spark Extras) to handle Spark Connectors and Spark
add-ons in general could be very beneficial to Spark and the overall
Spark community, which would have a central place in Apache to
collaborate around related Spark components.

Some of the benefits on this approach

- Enables maintaining the connectors inside Apache, following
the Apache
governance and release rules, while allowing Spark proper to
focus on
the core runtime.
- Provides more flexibility in controlling the direction
(currency) of
the existing connectors (e.g. willing to find a solution and
maintain
multiple versions of same connectors like kafka 0.8x and 0.9x)
- Becomes a home for other types of Spark related connectors helping
expanding the community around Spark (e.g. Zeppelin see most of it's
current contribution around new/enhanced connectors)

What are some requirements for Spark Extras to be successful:

- Be up to date with Spark Trunk APIs (based on daily CIs
against SNAPSHOT)
- Adhere to Spark release cycles (have a very little window
compared to
Spark release)
- Be more open and flexible to the set of connectors it will
accept and
maintain (e.g. also handle multiple versions like the kafka 0.9
issue we
have today)

Where to start Spark Extras

Depending on the interest here, we could follow the steps of (Apache
Arrow) and start this directly as a TLP, or start as an incubator
project. I would consider the first option first.

Who would participate

Have thought about this for a bit, and if we go to the direction
of TLP,
I would say Spark Committers and Apache Members can reques

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Luciano Resende

On Sat, Mar 26, 2016 at 10:20 AM, Jean-Baptiste Onofré 
wrote:

> Hi Luciano,
>
> If we take the "pure" technical vision, there's pros and cons of having
> spark-extra (or whatever the name we give) still as an Apache project:
>
> Pro:
>  - Governance & Quality Insurance: we follow the Apache rules, meaning
> that a release has to be staged and voted by the PMC. It's a form of
> governance of the project and quality (as the releases are reviewed).
>  - Software origin: users know where the connector comes from, and they
> have the guarantee in term of licensing, etc.
>  - IP/ICLA: We know the committers of this project, and we know they agree
> with the ICL agreement.
>
> Cons:
>  - Third licenses support. As an Apache project, the "connectors" will be
> allowed to use only Apache or Category B licensed dependencies. For
> instance, if I would like to create a Spark connector for couchbase, I
> can't do it at Apache.
>

Yes, this is not solving the incompatible license problems


>  - Release cycle. As an Apache project, it means we have to follow the
> rules, meaning that the release cycle can appear strict and long due to the
> staging and vote process. For me, it's a huge benefit but some can see as
> too strict ;)
>

IMHO, This is the small price we pay for all the good stuff you mentioned
in pro


>
> Maybe, we can imagine both, as we have in ServiceMix or Camel:
> - all modules/connectors matching the Apache rule (especially in term of
> licensing) should be in the Apache Spark-Modules (or Spark-Extensions, or
> whatever). It's like the ServiceMix Bundles.
>

If you are talking here about Spark proper, then we are currently seeing
that this is going to be hard. If there was a way to have more flexibility
to host these directly into Spark proper, I would never be creating this
thread as we would have all the pros you mentioned hosting them directly
into Spark.


> - all modules/connectors that can't fit into the Apache rule (due to
> licensing issue) can go into GitHub Spark-Extra (or Spark-Package). It's
> like the ServiceMix Extra or Camel Extra on github.
>
>
We could look into this, but it might be a "Spark Extra  discussion" on how
we can help foster a community around the non-compatible licensed
connectors.


> My $0.01.
>
> Regards
> JB
>
>
> On 03/26/2016 06:07 PM, Luciano Resende wrote:
>
>> I believe some of this has been resolved in the context of some parts
>> that had interest in one extra connector, but we still have a few
>> removed, and as you mentioned, we still don't have a simple way or
>> willingness to manage and be current on new packages like kafka. And
>> based on the fact that this thread is still alive, I believe that other
>> community members might have other concerns as well.
>>
>> After some thought, I believe having a separate project (what was
>> mentioned here as Spark Extras) to handle Spark Connectors and Spark
>> add-ons in general could be very beneficial to Spark and the overall
>> Spark community, which would have a central place in Apache to
>> collaborate around related Spark components.
>>
>> Some of the benefits on this approach
>>
>> - Enables maintaining the connectors inside Apache, following the Apache
>> governance and release rules, while allowing Spark proper to focus on
>> the core runtime.
>> - Provides more flexibility in controlling the direction (currency) of
>> the existing connectors (e.g. willing to find a solution and maintain
>> multiple versions of same connectors like kafka 0.8x and 0.9x)
>> - Becomes a home for other types of Spark related connectors helping
>> expanding the community around Spark (e.g. Zeppelin see most of it's
>> current contribution around new/enhanced connectors)
>>
>> What are some requirements for Spark Extras to be successful:
>>
>> - Be up to date with Spark Trunk APIs (based on daily CIs against
>> SNAPSHOT)
>> - Adhere to Spark release cycles (have a very little window compared to
>> Spark release)
>> - Be more open and flexible to the set of connectors it will accept and
>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
>> have today)
>>
>> Where to start Spark Extras
>>
>> Depending on the interest here, we could follow the steps of (Apache
>> Arrow) and start this directly as a TLP, or start as an incubator
>> project. I would consider the first option first.
>>
>> Who would participate
>>
>> Have thought about this for a bit, and if we go to the direction of TLP,
>> I would say Spark Committers and Apache Members can request to
>> participate as PMC members, while other committers can request to become
>> committers. Non committers would be added based on meritocracy after the
>> start of the project.
>>
>> Project Name
>>
>> It would be ideal if we could have a project name that shows close ties
>> to Spark (e.g. Spark Extras or Spark Connectors) but we will need
>> permission and support from whoever is going to evaluate the project
>> proposal (e.g. Apache Board)
>>
>>
>> Thoughts

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Jean-Baptiste Onofré

Hi Luciano,

If we take the "pure" technical vision, there's pros and cons of having
spark-extra (or whatever the name we give) still as an Apache project:

Pro:
- Governance & Quality Insurance: we follow the Apache rules, meaning
that a release has to be staged and voted by the PMC. It's a form of
governance of the project and quality (as the releases are reviewed).
- Software origin: users know where the connector comes from, and they
have the guarantee in term of licensing, etc.
- IP/ICLA: We know the committers of this project, and we know they
agree with the ICL agreement.

Cons:
- Third licenses support. As an Apache project, the "connectors" will
be allowed to use only Apache or Category B licensed dependencies. For
instance, if I would like to create a Spark connector for couchbase, I
can't do it at Apache.
- Release cycle. As an Apache project, it means we have to follow the
rules, meaning that the release cycle can appear strict and long due to
the staging and vote process. For me, it's a huge benefit but some can
see as too strict ;)

Maybe, we can imagine both, as we have in ServiceMix or Camel:
- all modules/connectors matching the Apache rule (especially in term of
licensing) should be in the Apache Spark-Modules (or Spark-Extensions,
or whatever). It's like the ServiceMix Bundles.
- all modules/connectors that can't fit into the Apache rule (due to
licensing issue) can go into GitHub Spark-Extra (or Spark-Package). It's
like the ServiceMix Extra or Camel Extra on github.

My $0.01.

Regards
JB

On 03/26/2016 06:07 PM, Luciano Resende wrote:

I believe some of this has been resolved in the context of some parts
that had interest in one extra connector, but we still have a few
removed, and as you mentioned, we still don't have a simple way or
willingness to manage and be current on new packages like kafka. And
based on the fact that this thread is still alive, I believe that other
community members might have other concerns as well.

After some thought, I believe having a separate project (what was
mentioned here as Spark Extras) to handle Spark Connectors and Spark
add-ons in general could be very beneficial to Spark and the overall
Spark community, which would have a central place in Apache to
collaborate around related Spark components.

Some of the benefits on this approach

- Enables maintaining the connectors inside Apache, following the Apache
governance and release rules, while allowing Spark proper to focus on
the core runtime.
- Provides more flexibility in controlling the direction (currency) of
the existing connectors (e.g. willing to find a solution and maintain
multiple versions of same connectors like kafka 0.8x and 0.9x)
- Becomes a home for other types of Spark related connectors helping
expanding the community around Spark (e.g. Zeppelin see most of it's
current contribution around new/enhanced connectors)

What are some requirements for Spark Extras to be successful:

- Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT)
- Adhere to Spark release cycles (have a very little window compared to
Spark release)
- Be more open and flexible to the set of connectors it will accept and
maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
have today)

Where to start Spark Extras

Depending on the interest here, we could follow the steps of (Apache
Arrow) and start this directly as a TLP, or start as an incubator
project. I would consider the first option first.

Who would participate

Have thought about this for a bit, and if we go to the direction of TLP,
I would say Spark Committers and Apache Members can request to
participate as PMC members, while other committers can request to become
committers. Non committers would be added based on meritocracy after the
start of the project.

Project Name

It would be ideal if we could have a project name that shows close ties
to Spark (e.g. Spark Extras or Spark Connectors) but we will need
permission and support from whoever is going to evaluate the project
proposal (e.g. Apache Board)

Thoughts ?

Does anyone have any big disagreement or objection to moving into this
direction ?

Otherwise, who would be interested in joining the project, so I can
start working on some concrete proposal ?

On Sat, Mar 26, 2016 at 6:58 AM, Sean Owen mailto:so...@cloudera.com>> wrote:

This has been resolved; see the JIRA and related PRs but also

http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html

This is not a scenario where a [VOTE] needs to take place, and code
changes don't proceed through PMC votes. From the project perspective,
code was deleted/retired for lack of interest, and this is controlled
by the normal lazy consensus protocol which wasn't vetoed.

The subsequent discussion was in part about whether other modules
should go, or whether one should come back, which it did. The latter
suggests that chan

Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Luciano Resende

I believe some of this has been resolved in the context of some parts that
had interest in one extra connector, but we still have a few removed, and
as you mentioned, we still don't have a simple way or willingness to manage
and be current on new packages like kafka. And based on the fact that this
thread is still alive, I believe that other community members might have
other concerns as well.

After some thought, I believe having a separate project (what was mentioned
here as Spark Extras) to handle Spark Connectors and Spark add-ons in
general could be very beneficial to Spark and the overall Spark community,
which would have a central place in Apache to collaborate around related
Spark components.

Some of the benefits on this approach

- Enables maintaining the connectors inside Apache, following the Apache
governance and release rules, while allowing Spark proper to focus on the
core runtime.
- Provides more flexibility in controlling the direction (currency) of the
existing connectors (e.g. willing to find a solution and maintain multiple
versions of same connectors like kafka 0.8x and 0.9x)
- Becomes a home for other types of Spark related connectors helping
expanding the community around Spark (e.g. Zeppelin see most of it's
current contribution around new/enhanced connectors)

What are some requirements for Spark Extras to be successful:

- Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT)
- Adhere to Spark release cycles (have a very little window compared to
Spark release)
- Be more open and flexible to the set of connectors it will accept and
maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
have today)

Where to start Spark Extras

Depending on the interest here, we could follow the steps of (Apache Arrow)
and start this directly as a TLP, or start as an incubator project. I would
consider the first option first.

Who would participate

Have thought about this for a bit, and if we go to the direction of TLP, I
would say Spark Committers and Apache Members can request to participate as
PMC members, while other committers can request to become committers. Non
committers would be added based on meritocracy after the start of the
project.

Project Name

It would be ideal if we could have a project name that shows close ties to
Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
and support from whoever is going to evaluate the project proposal (e.g.
Apache Board)

Thoughts ?

Does anyone have any big disagreement or objection to moving into this
direction ?

Otherwise, who would be interested in joining the project, so I can start
working on some concrete proposal ?

On Sat, Mar 26, 2016 at 6:58 AM, Sean Owen  wrote:

> This has been resolved; see the JIRA and related PRs but also
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html
>
> This is not a scenario where a [VOTE] needs to take place, and code
> changes don't proceed through PMC votes. From the project perspective,
> code was deleted/retired for lack of interest, and this is controlled
> by the normal lazy consensus protocol which wasn't vetoed.
>
> The subsequent discussion was in part about whether other modules
> should go, or whether one should come back, which it did. The latter
> suggests that change could have been left open for some discussion
> longer. Ideally, you would have commented before the initial change
> happened, but it sounds like several people would have liked more
> time. I don't think I'd call that "improper conduct" though, no. It
> was reversed via the same normal code management process.
>
> The rest of the question concerned what becomes of the code that was
> removed. It was revived outside the project for anyone who cares to
> continue collaborating. There seemed to be no disagreement about that,
> mostly because the code in question was of minimal interest. PMC
> doesn't need to rule on anything. There may still be some loose ends
> there like namespace changes. I'll add to the other thread about this.
>
>
>
> On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski  wrote:
> > Hi,
> >
> > Although I'm not that much experienced member of ASF, I share your
> > concerns. I haven't looked at the issue from this point of view, but
> > after having read the thread I think PMC should've signed off the
> > migration of ASF-owned code to a non-ASF repo. At least a vote is
> > required (and this discussion is a sign that the process has not been
> > conducted properly as people have concerns, me including).
> >
> > Thanks Mridul!
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan 
> wrote:
> >> I am not referring to code edits - but to migrating submodules and
> >> code currently in Apache Spar

Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Sean Owen

This has been resolved; see the JIRA and related PRs but also
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html

This is not a scenario where a [VOTE] needs to take place, and code
changes don't proceed through PMC votes. From the project perspective,
code was deleted/retired for lack of interest, and this is controlled
by the normal lazy consensus protocol which wasn't vetoed.

The subsequent discussion was in part about whether other modules
should go, or whether one should come back, which it did. The latter
suggests that change could have been left open for some discussion
longer. Ideally, you would have commented before the initial change
happened, but it sounds like several people would have liked more
time. I don't think I'd call that "improper conduct" though, no. It
was reversed via the same normal code management process.

The rest of the question concerned what becomes of the code that was
removed. It was revived outside the project for anyone who cares to
continue collaborating. There seemed to be no disagreement about that,
mostly because the code in question was of minimal interest. PMC
doesn't need to rule on anything. There may still be some loose ends
there like namespace changes. I'll add to the other thread about this.



On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski  wrote:
> Hi,
>
> Although I'm not that much experienced member of ASF, I share your
> concerns. I haven't looked at the issue from this point of view, but
> after having read the thread I think PMC should've signed off the
> migration of ASF-owned code to a non-ASF repo. At least a vote is
> required (and this discussion is a sign that the process has not been
> conducted properly as people have concerns, me including).
>
> Thanks Mridul!
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan  wrote:
>> I am not referring to code edits - but to migrating submodules and
>> code currently in Apache Spark to 'outside' of it.
>> If I understand correctly, assets from Apache Spark are being moved
>> out of it into thirdparty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it wrong 
>> !
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
>>> Why would a PMC vote be necessary on every code deletion?
>>>
>>> There was a Jira and pull request discussion about the submodules that
>>> have been removed so far.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-13843
>>>
>>> There's another ongoing one about Kafka specifically
>>>
>>> https://issues.apache.org/jira/browse/SPARK-13877
>>>
>>>
>>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  
>>> wrote:

 I was not aware of a discussion in Dev list about this - agree with most of
 the observations.
 In addition, I did not see PMC signoff on moving (sub-)modules out.

 Regards
 Mridul



 On Thursday, March 17, 2016, Marcelo Vanzin  wrote:
>
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spa

Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Jacek Laskowski

Hi,

Although I'm not that much experienced member of ASF, I share your
concerns. I haven't looked at the issue from this point of view, but
after having read the thread I think PMC should've signed off the
migration of ASF-owned code to a non-ASF repo. At least a vote is
required (and this discussion is a sign that the process has not been
conducted properly as people have concerns, me including).

Thanks Mridul!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan  wrote:
> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
>> Why would a PMC vote be necessary on every code deletion?
>>
>> There was a Jira and pull request discussion about the submodules that
>> have been removed so far.
>>
>> https://issues.apache.org/jira/browse/SPARK-13843
>>
>> There's another ongoing one about Kafka specifically
>>
>> https://issues.apache.org/jira/browse/SPARK-13877
>>
>>
>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  
>> wrote:
>>>
>>> I was not aware of a discussion in Dev list about this - agree with most of
>>> the observations.
>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>>
>>> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:

 Hello all,

 Recently a lot of the streaming backends were moved to a separate
 project on github and removed from the main Spark repo.

 While I think the idea is great, I'm a little worried about the
 execution. Some concerns were already raised on the bug mentioned
 above, but I'd like to have a more explicit discussion about this so
 things don't fall through the cracks.

 Mainly I have three concerns.

 i. Ownership

 That code used to be run by the ASF, but now it's hosted in a github
 repo owned not by the ASF. That sounds a little sub-optimal, if not
 problematic.

 ii. Governance

 Similar to the above; who has commit access to the above repos? Will
 all the Spark committers, present and future, have commit access to
 all of those repos? Are they still going to be considered part of
 Spark and have release management done through the Spark community?

 For both of the questions above, why are they not turned into
 sub-projects of Spark and hosted on the ASF repos? I believe there is
 a mechanism to do that, without the need to keep the code in the main
 Spark repo, right?

 iii. Usability

 This is another thing I don't see discussed. For Scala-based code
 things don't change much, I guess, if the artifact names don't change
 (another reason to keep things in the ASF?), but what about python?
 How are pyspark users expected to get that code going forward, since
 it's not in Spark's pyspark.zip anymore?

 Is there an easy way of keeping these things within the ASF Spark
 project? I think that would be better for everybody.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-25 Thread David Nalley


> As far as group / artifact name compatibility, at least in the case of
> Kafka we need different artifact names anyway, and people are going to
> have to make changes to their build files for spark 2.0 anyway.   As
> far as keeping the actual classes in org.apache.spark to not break
> code despite the group name being different, I don't know whether that
> would be enforced by maven central, just looked at as poor taste, or
> ASF suing for trademark violation :)


Sonatype, has strict instructions to only permit org.apache.* to originate from 
repository.apache.org. Exceptions to that must be approved by VP, 
Infrastructure. 
--
Sent via Pony Mail for dev@spark.apache.org. 
View this email online at:
https://pony-poc.apache.org/list.html?dev@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-20 Thread Marcelo Vanzin

Hi Reynold, thanks for the info.

On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
> If one really feels strongly that we should go through all the overhead to
> setup an ASF subproject for these modules that won't work with the new
> structured streaming, and want to spearhead to setup separate repos
> (preferably one subproject per connector), CI, separate JIRA, governance,
> READMEs, voting, we can discuss that. Until then, I'd keep the github option
> open because IMHO it is what works the best for end users (including
> discoverability, issue tracking, release publishing, ...).

For those of us who are not exactly familiar with the inner workings
of administrating ASF projects, would you mind explaining in more
detail what this overhead is?

>From my naive point of view, when I say "sub project" I assume that
it's a simple as having a separate git repo for it, tied to the same
parent project. Everything else - JIRA, committers, bylaws, etc -
remains the same. And since the project we're talking about are very
small, CI should be very simple (Travis?) and, assuming sporadic
releases, things overall should not be that expensive to maintain.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger

i.  An ASF project can clearly decide that some of its code is no
longer worth maintaining and delete it.  This isn't really any
different. It's still apache licensed so ultimately whoever wants the
code can get it.

ii.  I think part of the rationale is to not tie release management to
Spark, so it can proceed on a schedule that makes sense.  I'm fine
with helping out with release management for the Kafka subproject, for
instance.  I agree that practical governance questions need to be
worked out.

iii.  How is this any different from how python users get access to
any other third party Spark package?


On Thu, Mar 17, 2016 at 1:14 PM, Marcelo Vanzin  wrote:
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran


Spark has hit one of the enternal problems of OSS projects, one hit by: ant, 
maven, hadoop, ... anything with a plugin model.

Take in the plugin: you're in control, but also down for maintenance

Leave out the plugin: other people can maintain it, be more agile, etc.

But you've lost control, and you can't even manage the links. Here I think 
maven suffered the most by keeping stuff in codehaus; migrating off there is 
still hard —not only did they lose the links: they lost the JIRA.

Maven's relationship with codehaus was very tightly coupled, lots of committers 
on both; I don't know how that relationship was handled at a higher level.


On 17 Mar 2016, at 20:51, Hari Shreedharan 
mailto:hshreedha...@cloudera.com>> wrote:

I have worked with various ASF projects for 4+ years now. Sure, ASF projects 
can delete code as they feel fit. But this is the first time I have really seen 
code being "moved out" of a project without discussion. I am sure you can do 
this without violating ASF policy, but the explanation for that would be 
convoluted (someone decided to make a copy and then the ASF project deleted 
it?).

+1 for discussion. Dev changes should -> dev list; PMC for process in general. 
Don't think the ASF will overlook stuff like that.

Might want to raise this issue on the next broad report


FWIW, it may be better to just see if you can have committers to work on these 
projects: recruit the people and say 'please, only work in this area —for now". 
That gets developers on your team, which is generally considered a metric of 
health in a project.

Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with a 
focus on extras with their own support channel.


Also, moving the code out would break compatibility. AFAIK, there is no way to 
push org.apache.* artifacts directly to maven central. That happens via 
mirroring from the ASF maven repos. Even if it you could somehow directly push 
the artifacts to mvn, you really can push to org.apache.* groups only if you 
are part of the repo and acting as an agent of that project (which in this case 
would be Apache Spark). Once you move the code out, even a committer/PMC member 
would not be representing the ASF when pushing the code. I am not sure if there 
is a way to fix this issue.




This topic has cropped up in the general context of third party repos 
publishing artifacts with org.apache names but vendor specfic suffixes (e.g 
org.apache.hadoop/hadoop-common.5.3-cdh.jar

Some people were pretty unhappy about this, but the conclusion reached was 
"maven doesn't let you do anything else and still let downstream people use 
it". Futhermore, as all ASF releases are nominally the source releases *not the 
binaries*, you can look at the POMs and say "we've released source code 
designed to publish artifacts to repos —this is 'use as intended'.

People are also free to cut their own full project distributions, etc, etc. For 
example, I stick up the binaries of Windows builds independent of the ASF 
releases; these were originally just those from HDP on windows installs, now I 
check out the commit of the specific ASF release on a windows 2012 VM, do the 
build, copy the binaries. Free for all to use. But I do suspect that the ASF 
legal protections get a bit blurred here. These aren't ASF binaries, but 
binaries built directly from unmodified ASF releases.

In contrast to sticking stuff into a github repo, the moved artifacts cannot be 
published as org.apache artfacts on maven central. That's non-negotiable as far 
as the ASF are concerned. The process for releasing ASF artifacts there goes 
downstream of the ASF public release process: you stage the artifacts, they are 
part of the vote process, everything with org.apache goes through it.

That said: there is nothing to stop a set of shell org.apache artifacts being 
written which do nothing but contain transitive dependencies on artifacts in 
different groups, such as org.spark-project. The shells would be released by 
the ASF; they pull in the new stuff. And, therefore, it'd be possible to build 
a spark-assembly with the files. (I'm ignoring a loop in the build DAG here, 
playing with git submodules would let someone eliminate this by adding the 
removed libraries under a modified project.

I think there might some issues related to package names; you could make a case 
for having public APIs with the original names —they're the API, after all, and 
that's exactly what Apache Harmony did with the java.* packages.


Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:
I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As PMC is responsible for the project assets (in

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Hari Shreedharan

I have worked with various ASF projects for 4+ years now. Sure, ASF
projects can delete code as they feel fit. But this is the first time I
have really seen code being "moved out" of a project without discussion. I
am sure you can do this without violating ASF policy, but the explanation
for that would be convoluted (someone decided to make a copy and then the
ASF project deleted it?).

Also, moving the code out would break compatibility. AFAIK, there is no way
to push org.apache.* artifacts directly to maven central. That happens via
mirroring from the ASF maven repos. Even if it you could somehow directly
push the artifacts to mvn, you really can push to org.apache.* groups only
if you are part of the repo and acting as an agent of that project (which
in this case would be Apache Spark). Once you move the code out, even a
committer/PMC member would not be representing the ASF when pushing the
code. I am not sure if there is a way to fix this issue.


Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
wrote:

> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it
> wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
> wrote:
> > Why would a PMC vote be necessary on every code deletion?
> >
> > There was a Jira and pull request discussion about the submodules that
> > have been removed so far.
> >
> > https://issues.apache.org/jira/browse/SPARK-13843
> >
> > There's another ongoing one about Kafka specifically
> >
> > https://issues.apache.org/jira/browse/SPARK-13877
> >
> >
> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan 
> wrote:
> >>
> >> I was not aware of a discussion in Dev list about this - agree with
> most of
> >> the observations.
> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>
> >> Regards
> >> Mridul
> >>
> >>
> >>
> >> On Thursday, March 17, 2016, Marcelo Vanzin 
> wrote:
> >>>
> >>> Hello all,
> >>>
> >>> Recently a lot of the streaming backends were moved to a separate
> >>> project on github and removed from the main Spark repo.
> >>>
> >>> While I think the idea is great, I'm a little worried about the
> >>> execution. Some concerns were already raised on the bug mentioned
> >>> above, but I'd like to have a more explicit discussion about this so
> >>> things don't fall through the cracks.
> >>>
> >>> Mainly I have three concerns.
> >>>
> >>> i. Ownership
> >>>
> >>> That code used to be run by the ASF, but now it's hosted in a github
> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>> problematic.
> >>>
> >>> ii. Governance
> >>>
> >>> Similar to the above; who has commit access to the above repos? Will
> >>> all the Spark committers, present and future, have commit access to
> >>> all of those repos? Are they still going to be considered part of
> >>> Spark and have release management done through the Spark community?
> >>>
> >>>
> >>> For both of the questions above, why are they not turned into
> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>> a mechanism to do that, without the need to keep the code in the main
> >>> Spark repo, right?
> >>>
> >>> iii. Usability
> >>>
> >>> This is another thing I don't see discussed. For Scala-based code
> >>> things don't change much, I guess, if the artifact names don't change
> >>> (another reason to keep things in the ASF?), but what about python?
> >>> How are pyspark users expected to get that code going forward, since
> >>> it's not in Spark's pyspark.zip anymore?
> >>>
> >>>
> >>> Is there an easy way of keeping these things within the ASF Spark
> >>> project? I think that would be better for everybody.
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Imran Rashid

On Thu, Mar 17, 2016 at 2:55 PM, Cody Koeninger  wrote:

> Why would a PMC vote be necessary on every code deletion?
>

Certainly PMC votes are not necessary on *every* code deletion.  I dont'
think there is a very clear rule on when such discussion is warranted, just
a soft expectation that committers understand which changes require more
discussion before getting merged.  I believe the only formal requirement
for a PMC vote is when there is a release.  But I think as a community we'd
much rather deal with these issues ahead of time, rather than having
contentious discussions around releases because some are strongly opposed
to changes that have already been merged.

I'm all for the idea of removing these modules in general (for all of the
reasons already mentioned), but it seems that there are important questions
about how the new packages get distributed and how they are managed that
merit further discussion.

I'm somewhat torn on the question of the sub-project vs independent, and
how its governed.  I think Steve has summarized the tradeoffs very well.  I
do want to emphasize, though, that if they are entirely external from the
ASF, the artifact ids and the package names must change at the very least.

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger

There's a difference between "without discussion" and "without as much
discussion as I would have liked to have a chance to notice it".
There are plenty of PRs that got merged before I noticed them that I
would rather have not gotten merged.

As far as group / artifact name compatibility, at least in the case of
Kafka we need different artifact names anyway, and people are going to
have to make changes to their build files for spark 2.0 anyway.   As
far as keeping the actual classes in org.apache.spark to not break
code despite the group name being different, I don't know whether that
would be enforced by maven central, just looked at as poor taste, or
ASF suing for trademark violation :)

For people who would rather the problem be solved with official asf
subprojects, which committers are volunteering to help do that work?
Reynold already said he doesn't want to mess with that overhead.

I'm fine with continuing to help work on the Kafka integration
wherever it ends up, I'd just like the color of the bikeshed to get
decided so we can build a decent bike...


On Thu, Mar 17, 2016 at 3:51 PM, Hari Shreedharan
 wrote:
> I have worked with various ASF projects for 4+ years now. Sure, ASF projects
> can delete code as they feel fit. But this is the first time I have really
> seen code being "moved out" of a project without discussion. I am sure you
> can do this without violating ASF policy, but the explanation for that would
> be convoluted (someone decided to make a copy and then the ASF project
> deleted it?).
>
> Also, moving the code out would break compatibility. AFAIK, there is no way
> to push org.apache.* artifacts directly to maven central. That happens via
> mirroring from the ASF maven repos. Even if it you could somehow directly
> push the artifacts to mvn, you really can push to org.apache.* groups only
> if you are part of the repo and acting as an agent of that project (which in
> this case would be Apache Spark). Once you move the code out, even a
> committer/PMC member would not be representing the ASF when pushing the
> code. I am not sure if there is a way to fix this issue.
>
>
> Thanks,
> Hari
>
> On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
> wrote:
>>
>> I am not referring to code edits - but to migrating submodules and
>> code currently in Apache Spark to 'outside' of it.
>> If I understand correctly, assets from Apache Spark are being moved
>> out of it into thirdparty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it
>> wrong !
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
>> wrote:
>> > Why would a PMC vote be necessary on every code deletion?
>> >
>> > There was a Jira and pull request discussion about the submodules that
>> > have been removed so far.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13843
>> >
>> > There's another ongoing one about Kafka specifically
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13877
>> >
>> >
>> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan 
>> > wrote:
>> >>
>> >> I was not aware of a discussion in Dev list about this - agree with
>> >> most of
>> >> the observations.
>> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
>> >>
>> >> Regards
>> >> Mridul
>> >>
>> >>
>> >>
>> >> On Thursday, March 17, 2016, Marcelo Vanzin 
>> >> wrote:
>> >>>
>> >>> Hello all,
>> >>>
>> >>> Recently a lot of the streaming backends were moved to a separate
>> >>> project on github and removed from the main Spark repo.
>> >>>
>> >>> While I think the idea is great, I'm a little worried about the
>> >>> execution. Some concerns were already raised on the bug mentioned
>> >>> above, but I'd like to have a more explicit discussion about this so
>> >>> things don't fall through the cracks.
>> >>>
>> >>> Mainly I have three concerns.
>> >>>
>> >>> i. Ownership
>> >>>
>> >>> That code used to be run by the ASF, but now it's hosted in a github
>> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>> >>> problematic.
>> >>>
>> >>> ii. Governance
>> >>>
>> >>> Similar to the above; who has commit access to the above repos? Will
>> >>> all the Spark committers, present and future, have commit access to
>> >>> all of those repos? Are they still going to be considered part of
>> >>> Spark and have release management done through the Spark community?
>> >>>
>> >>>
>> >>> For both of the questions above, why are they not turned into
>> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>> >>> a mechanism to do that, without the need to keep the code in the main
>> >>> Spark repo, right?
>> >>>
>> >>> iii. Usability
>> >>>
>> >>> This is another thing I don't see discussed. For Scala-based code
>> >>> thi

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin

On Thu, Mar 17, 2016 at 12:01 PM, Cody Koeninger  wrote:
> i.  An ASF project can clearly decide that some of its code is no
> longer worth maintaining and delete it.  This isn't really any
> different. It's still apache licensed so ultimately whoever wants the
> code can get it.

Absolutely. But I don't remember this being discussed either way. Was
the intention, as you mention later, just to decouple the release of
those components from the main Spark release, or to completely disown
that code?

If the latter, is the ASF ok with it still retaining the current
package and artifact names? Changing those would break backwards
compatibility. Which is why I believe that keeping them as a
sub-project, even if their release cadence is much slower, would be a
better solution for both developers and users.

> ii.  I think part of the rationale is to not tie release management to
> Spark, so it can proceed on a schedule that makes sense.  I'm fine
> with helping out with release management for the Kafka subproject, for
> instance.  I agree that practical governance questions need to be
> worked out.
>
> iii.  How is this any different from how python users get access to
> any other third party Spark package?

True, but that requires the modules to be published somewhere, not
just to live as a bunch of .py files in a gitbub repo. Basically, I'm
worried that there's work to be done to keep those modules working in
this new environment - how to build, test, and publish things, remove
potential uses of internal Spark APIs, just to cite a couple of
things.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin

On Fri, Mar 18, 2016 at 10:09 AM, Jean-Baptiste Onofré  
wrote:
> a project can have multiple repos: it's what we have in ServiceMix, in
> Karaf.
> For the *-extra on github, if the code has been in the ASF, the PMC members
> have to vote to move the code on *-extra.

That's good to know. To me that sounds like the best solution.

I've heard that top-level projects have some requirements with regards
to have active development, and these components probably will not see
that much activity. And top-level does sound like too much bureaucracy
for this.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin

Hello all,

Recently a lot of the streaming backends were moved to a separate
project on github and removed from the main Spark repo.

While I think the idea is great, I'm a little worried about the
execution. Some concerns were already raised on the bug mentioned
above, but I'd like to have a more explicit discussion about this so
things don't fall through the cracks.

Mainly I have three concerns.

i. Ownership

That code used to be run by the ASF, but now it's hosted in a github
repo owned not by the ASF. That sounds a little sub-optimal, if not
problematic.

ii. Governance

Similar to the above; who has commit access to the above repos? Will
all the Spark committers, present and future, have commit access to
all of those repos? Are they still going to be considered part of
Spark and have release management done through the Spark community?


For both of the questions above, why are they not turned into
sub-projects of Spark and hosted on the ASF repos? I believe there is
a mechanism to do that, without the need to keep the code in the main
Spark repo, right?

iii. Usability

This is another thing I don't see discussed. For Scala-based code
things don't change much, I guess, if the artifact names don't change
(another reason to keep things in the ASF?), but what about python?
How are pyspark users expected to get that code going forward, since
it's not in Spark's pyspark.zip anymore?


Is there an easy way of keeping these things within the ASF Spark
project? I think that would be better for everybody.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Mridul Muralidharan

I was not aware of a discussion in Dev list about this - agree with most of
the observations.
In addition, I did not see PMC signoff on moving (sub-)modules out.

Regards
Mridul


On Thursday, March 17, 2016, Marcelo Vanzin  wrote:

> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Adam Kocoloski


> On Mar 19, 2016, at 8:32 AM, Steve Loughran  wrote:
> 
> 
>> On 18 Mar 2016, at 17:07, Marcelo Vanzin  wrote:
>> 
>> Hi Steve, thanks for the write up.
>> 
>> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  
>> wrote:
>>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs 
>>> to go through incubation. While normally its the incubator PMC which 
>>> sponsors/oversees the incubating project, it doesn't have to be the case: 
>>> the spark project can do it.
>>> 
>>> Also Apache Arrow managed to make it straight to toplevel without that 
>>> process. Given that the spark extras are already ASF source files, you 
>>> could try the same thing, add all the existing committers, then look for 
>>> volunteers to keep things.
>> 
>> Am I to understand from your reply that it's not possible for a single
>> project to have multiple repos?
>> 
> 
> 
> I don't know. there's generally a 1 project -> 1x issue, 1x JIRA.
> 
> but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to 
> that repo, with the special exception of branches (encryption, ipv6) that 
> have their own committers.
> 
> oh, and I know that hadoop site is on SVN, as are other projects, just to 
> integrate with asf site publishing, so you can certainly have 1x git + 1 x svn
> 
> ASF won't normally let you have 1 repo with different bits of the tree having 
> different access rights, so you couldn't open up spark-extras to people with 
> less permissions/rights than others.
> 
> A separate repo will, separate issue tracking helps you isolate stuff

Multiple repositories per project are certainly allowed without incurring the 
overhead of a subproject; Cordova and CouchDB are two projects that have taken 
this approach:

https://github.com/apache?utf8=✓&query=cordova-
https://github.com/apache?utf8=✓&query=couchdb-

I believe Cordova also generates independent release artifacts in different 
cycles (e.g. cordova-ios releases independently from cordova-android).

If the goal is to enable a divergent set of committers to spark-extras then an 
independent project makes sense. If you’re just looking to streamline the main 
repo and decouple some of these other streaming “backends” from the normal 
release cycle then there are low impact ways to accomplish this inside a single 
Apache Spark project. Cheers,

Adam


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Shane Curcuru

Marcelo Vanzin wrote earlier:

> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.

Question: why was the code removed from the Spark repo?  What's the harm
in keeping it available here?

The ASF is perfectly happy if anyone wants to fork our code - that's one
of the core tenets of the Apache license.  You just can't take the name
or trademarks, so you may need to change some package names or the like.

So it's fine if some people want to work on the code outside the
project.  But it's puzzling as to why the Spark PMC shouldn't keep the
code in the project as well, even if it might not have the same release
cycles or whatnot.

- Shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Jean-Baptiste Onofré


Hi Marcelo,

a project can have multiple repos: it's what we have in ServiceMix, in 
Karaf.


For the *-extra on github, if the code has been in the ASF, the PMC 
members have to vote to move the code on *-extra.


Regards
JB

On 03/18/2016 06:07 PM, Marcelo Vanzin wrote:

Hi Steve, thanks for the write up.

On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  wrote:

If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to 
go through incubation. While normally its the incubator PMC which 
sponsors/oversees the incubating project, it doesn't have to be the case: the 
spark project can do it.

Also Apache Arrow managed to make it straight to toplevel without that process. 
Given that the spark extras are already ASF source files, you could try the 
same thing, add all the existing committers, then look for volunteers to keep 
things.


Am I to understand from your reply that it's not possible for a single
project to have multiple repos?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran


> On 18 Mar 2016, at 22:24, Marcelo Vanzin  wrote:
> 
> On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann  wrote:
>> So, my comment here is that any code *cannot* be removed from an Apache
>> project if there is a VETO issued which so far I haven't seen, though maybe
>> Marcelo can clarify that.
> 
> No, my intention was not to veto the change. I'm actually for the
> removal of components if the community thinks they don't add much to
> the project. (I'm also not sure I can even veto things, not being a
> PMC member.)
> 
> I mainly wanted to know what was the path forward for those components
> because, with Cloudera's hat on, we care about one of them (streaming
> integration with flume), and we'd prefer if that code remained under
> the ASF umbrella in some way.
> 

I'd be supportive of a spark-extras project; it'd actually be  place to keep 
stuff I've worked on 
 -the yarn ATS 1/1.5 integration
 -that mutant hive JAR which has the consistent kryo dependency and different 
shadings

... etc

There's also the fact that the twitter streaming is a common example to play 
with, flume is popular in places too.

If you want to set up a new incubator with a goal of graduating fast, I'd help. 
As a key metric of getting out of incubator is active development, you just 
need to "recruit" contributors and keep them engaged.




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran


> On 18 Mar 2016, at 17:07, Marcelo Vanzin  wrote:
> 
> Hi Steve, thanks for the write up.
> 
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  
> wrote:
>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs 
>> to go through incubation. While normally its the incubator PMC which 
>> sponsors/oversees the incubating project, it doesn't have to be the case: 
>> the spark project can do it.
>> 
>> Also Apache Arrow managed to make it straight to toplevel without that 
>> process. Given that the spark extras are already ASF source files, you could 
>> try the same thing, add all the existing committers, then look for 
>> volunteers to keep things.
> 
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
> 


I don't know. there's generally a 1 project -> 1x issue, 1x JIRA.

but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to that 
repo, with the special exception of branches (encryption, ipv6) that have their 
own committers.

oh, and I know that hadoop site is on SVN, as are other projects, just to 
integrate with asf site publishing, so you can certainly have 1x git + 1 x svn

ASF won't normally let you have 1 repo with different bits of the tree having 
different access rights, so you couldn't open up spark-extras to people with 
less permissions/rights than others.

A separate repo will, separate issue tracking helps you isolate stuff

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Steve Loughran


> On 17 Mar 2016, at 21:33, Marcelo Vanzin  wrote:
> 
> Hi Reynold, thanks for the info.
> 
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
>> If one really feels strongly that we should go through all the overhead to
>> setup an ASF subproject for these modules that won't work with the new
>> structured streaming, and want to spearhead to setup separate repos
>> (preferably one subproject per connector), CI, separate JIRA, governance,
>> READMEs, voting, we can discuss that. Until then, I'd keep the github option
>> open because IMHO it is what works the best for end users (including
>> discoverability, issue tracking, release publishing, ...).
> 
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
> 
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
> 


If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to 
go through incubation. While normally its the incubator PMC which 
sponsors/oversees the incubating project, it doesn't have to be the case: the 
spark project can do it.


Also Apache Arrow managed to make it straight to toplevel without that process. 
Given that the spark extras are already ASF source files, you could try the 
same thing, add all the existing committers, then look for volunteers to keep 
things.


You'd get
 -a JIRA entry of your own, easy to reassign bugs from SPARK to SPARK-EXTRAS
 -a bit of git
 -ability to set up builds on ASF Jenkins. Regression testing against spark 
nightlies would be invaluable here.
 -the ability to stage and publish through ASF Nexus


-Steve

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Luciano Resende

On Fri, Mar 18, 2016 at 7:58 AM, Cody Koeninger  wrote:

> >  Or, as Cody Koeniger suggests, having a spark-extras project in the ASF
> with a focus on extras with their own support channel.
>
> To be clear, I didn't suggest that and don't think that's the best
> solution.  I said to the people who want things done that way, which
> committer is going to step up and do that organizational work?
>

I am currently not a committer, but If we are willing to go into the
direction of having another project as spark-extras, I can help drive the
bureaucratic work to make this a reality.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Sean Owen

Code can be removed from an ASF project.
That code can live on elsewhere (in accordance with the license)

It can't be presented as part of the official ASF project, like any
other 3rd party project
The package name certainly must change from org.apache.spark

I don't know of a protocol, but common sense dictates a good-faith
effort to offer equivalent access to the code (e.g. interested
committers should probably be repo owners too.)

This differs from "any other code deletion" in that there's an intent
to keep working on the code but outside the project.
More discussion -- like this one -- would have been useful beforehand
but nothing's undoable

Backwards-compatibility is not a good reason for things, because we're
talking about Spark 2.x, and we're already talking about distributing
the code differently.

Is the reason for this change decoupling releases? or changing governance?
Seems like the former, but we don't actually need the latter to achieve that.
There's an argument for a new repo, but this is not an argument for
moving X out of the project per se

I'm sure doing this in the ASF is more overhead, but if changing
governance is a non-goal, there's no choice.
Convenience can't trump that.

Kafka integration is clearly more important than the others.
It seems to need to stay within the project.
However this still leaves a packaging problem to solve, that might
need a new repo. This is orthgonal.

Here's what I think:

1. Leave the moved modules outside the project entirely
  (why not Kinesis though? that one was not made clear)
2. Change package names and make sure it's clearly presented as external
3. Add any committers that want to be repo owners as owners
4. Keep Kafka within the project
5. Add some subproject within the current project as needed to
accomplish distribution goals

On Thu, Mar 17, 2016 at 6:14 PM, Marcelo Vanzin  wrote:
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin

Note the non-kafka bug was filed right before the change was pushed.
So there really wasn't any discussion before the decision was made to
remove that code.

I'm just trying to merge both discussions here in the list where it's
a little bit more dynamic than bug updates that end up getting lost in
the noise.

On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Marcelo Vanzin

Also, just wanted to point out something:

On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
> Thanks for initiating this discussion. I merged the pull request because it
> was unblocking another major piece of work for Spark 2.0: not requiring
> assembly jars

While I do agree that's more important, the streaming assemblies
weren't really blocking that work. The fact that there are still
streaming assemblies in the build kinda proves that point. :-)

I even filed a task to look at getting rid of the streaming assemblies
(SPARK-13575; just the assemblies though, not the code) but while
working on it found it would be more complicated than expected, and
decided against it given that it didn't really affect work on the
other assemblies.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Cody Koeninger

Anyone can fork apache licensed code.  Committers can approve pull
requests that delete code from asf repos.  Because those two things
happen near each other in time, it's somehow a process violation?

I think the discussion would be better served by concentrating on how
we're going to solve the problem and move forward.

On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan  wrote:
> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
>> Why would a PMC vote be necessary on every code deletion?
>>
>> There was a Jira and pull request discussion about the submodules that
>> have been removed so far.
>>
>> https://issues.apache.org/jira/browse/SPARK-13843
>>
>> There's another ongoing one about Kafka specifically
>>
>> https://issues.apache.org/jira/browse/SPARK-13877
>>
>>
>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  
>> wrote:
>>>
>>> I was not aware of a discussion in Dev list about this - agree with most of
>>> the observations.
>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>>
>>> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:

 Hello all,

 Recently a lot of the streaming backends were moved to a separate
 project on github and removed from the main Spark repo.

 While I think the idea is great, I'm a little worried about the
 execution. Some concerns were already raised on the bug mentioned
 above, but I'd like to have a more explicit discussion about this so
 things don't fall through the cracks.

 Mainly I have three concerns.

 i. Ownership

 That code used to be run by the ASF, but now it's hosted in a github
 repo owned not by the ASF. That sounds a little sub-optimal, if not
 problematic.

 ii. Governance

 Similar to the above; who has commit access to the above repos? Will
 all the Spark committers, present and future, have commit access to
 all of those repos? Are they still going to be considered part of
 Spark and have release management done through the Spark community?

 For both of the questions above, why are they not turned into
 sub-projects of Spark and hosted on the ASF repos? I believe there is
 a mechanism to do that, without the need to keep the code in the main
 Spark repo, right?

 iii. Usability

 This is another thing I don't see discussed. For Scala-based code
 things don't change much, I guess, if the artifact names don't change
 (another reason to keep things in the ASF?), but what about python?
 How are pyspark users expected to get that code going forward, since
 it's not in Spark's pyspark.zip anymore?

 Is there an easy way of keeping these things within the ASF Spark
 project? I think that would be better for everybody.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Reynold Xin

Thanks for initiating this discussion. I merged the pull request because it
was unblocking another major piece of work for Spark 2.0: not requiring
assembly jars, which is arguably a lot more important than sources that are
less frequently used. I take full responsibility for that.

I think it's inaccurate to call them "backend" because it makes these
things sound a lot more serious, when in reality they are a bunch of
connectors to less frequently used streaming data sources (e.g. mqtt,
flume). But that's not that important here.

Another important factor is that over time, with the development of
structure streaming, we'd provide a new API for streaming sources that
unifies the way to connect arbitrary sources, and as a result all of these
sources need to be rewritten anyway. This is similar to the RDD ->
DataFrame transition for data sources, although it was initially painful,
but in the long run provides much better experience for end-users because
they only need to learn a single API for all sources, and it becomes
trivial to transition from one source to another, without actually
impacting business logic.

So the truth is that in the long run, the existing connectors will be
replaced by new ones, and they have been causing minor issues here and
there in the code base. Now issues like these are never black and white. By
moving them out, we'd require users to at least change the maven coordinate
in their build file (although things can still be made binary and source
compatible). So I made the call and asked the contributor to keep Kafka and
Kinesis in, because those are the most widely used (and could be more
contentious), and move everything else out.

I have personally done enough data sources or 3rd party packages for Spark
on github that I can setup a github repo with CI and maven publishing in
just under an hour. I do not expect a lot of changes to these packages
because the APIs have been fairly stable. So the thing I was optimizing for
was to minimize the time we need to spent on these packages given the
(expected) low activity and the shift to focus on structured streaming, and
also minimize the chance to break user apps to provide the best user
experience.

Github repo seems the simplest choice to me. I also made another decision
to provide separate repos (and thus issue trackers) on github for these
packages. The reason is that these connectors have very disjoint
communities. For example, the community that care about mqtt is likely very
different from the community that care about akka. It is much easier to
track all of these.

Logistics wise -- things are still in flux. I think it'd make a lot of
sense to give existing Spark committers (or at least the ones that have
contributed to streaming) write access to the github repos. IMHO, it is not
in any of the major Spark contributing organizations' strategic interest to
"own" these projects, especially considering most of the activities will
switch to structured streaming.

If one really feels strongly that we should go through all the overhead to
setup an ASF subproject for these modules that won't work with the new
structured streaming, and want to spearhead to setup separate repos
(preferably one subproject per connector), CI, separate JIRA, governance,
READMEs, voting, we can discuss that. Until then, I'd keep the github
option open because IMHO it is what works the best for end users (including
discoverability, issue tracking, release publishing, ...).

On Thu, Mar 17, 2016 at 1:50 PM, Cody Koeninger  wrote:

> Anyone can fork apache licensed code.  Committers can approve pull
> requests that delete code from asf repos.  Because those two things
> happen near each other in time, it's somehow a process violation?
>
> I think the discussion would be better served by concentrating on how
> we're going to solve the problem and move forward.
>
> On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan 
> wrote:
> > I am not referring to code edits - but to migrating submodules and
> > code currently in Apache Spark to 'outside' of it.
> > If I understand correctly, assets from Apache Spark are being moved
> > out of it into thirdparty external repositories - not owned by Apache.
> >
> > At a minimum, dev@ discussion (like this one) should be initiated.
> > As PMC is responsible for the project assets (including code), signoff
> > is required for it IMO.
> >
> > More experienced Apache members might be opine better in case I got it
> wrong !
> >
> >
> > Regards,
> > Mridul
> >
> >
> > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
> wrote:
> >> Why would a PMC vote be necessary on every code deletion?
> >>
> >> There was a Jira and pull request discussion about the submodules that
> >> have been removed so far.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13843
> >>
> >> There's another ongoing one about Kafka specifically
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13877
> >>
> >>
> >> On Thu, Mar 17, 2016 at 2:49 PM, Mr

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Mridul Muralidharan

I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As PMC is responsible for the project assets (including code), signoff
is required for it IMO.

More experienced Apache members might be opine better in case I got it wrong !


Regards,
Mridul


On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger  wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread chrismattmann

So, my comment here is that any code *cannot* be removed from an Apache
project if there is a VETO issued which so far I haven't seen, though maybe
Marcelo can clarify that.

However if a VETO was issued, then the code cannot be removed and must be
put back. Anyone can fork anything our license allows that, but the
community itself must steward the code and part of that is hearing
everyone's voice within that community before acting.

Cheers,
Chris



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-and-future-of-streaming-backends-tp16711p16749.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Luciano Resende

If the intention is to actually decouple and give a life of it's own to
these connectors, I would have expected that they would still be hosted as
different git repositories inside Apache even tough users will not really
see much difference as they would still be mirrored in GitHub. This makes
it much easier on the legal departments of the upstream consumers and
customers as well because the code still follow the so well received and
trusted Apache Governance and Apache Release Policies. As for
implementation details, we can have multiple repositories if we see a lot
of fragmented releases, or a single "connectors" repository which in our
side would make administration more easily.

On Thu, Mar 17, 2016 at 2:33 PM, Marcelo Vanzin  wrote:

> Hi Reynold, thanks for the info.
>
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin  wrote:
> > If one really feels strongly that we should go through all the overhead
> to
> > setup an ASF subproject for these modules that won't work with the new
> > structured streaming, and want to spearhead to setup separate repos
> > (preferably one subproject per connector), CI, separate JIRA, governance,
> > READMEs, voting, we can discuss that. Until then, I'd keep the github
> option
> > open because IMHO it is what works the best for end users (including
> > discoverability, issue tracking, release publishing, ...).
>

Agree that there might be a little overhead, but there are ways to minimize
this, and I am sure there are volunteers willing to help in favor of having
a more unifying project. Breaking things into multiple projects, and having
to manage the matrix of supported versions will be hell worst overhead.

>
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
>
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
>
>
Subprojects or even if we send this back to incubator as "connectors
project" is better then public github per package in my opinion.

Now, if with this move is signalizing to customers that the Streaming API
as in 1.x is going away in favor the new structure streaming APIs , then I
guess this is a complete different discussion.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Marcelo Vanzin

Hi Steve, thanks for the write up.

On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran  wrote:
> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs 
> to go through incubation. While normally its the incubator PMC which 
> sponsors/oversees the incubating project, it doesn't have to be the case: the 
> spark project can do it.
>
> Also Apache Arrow managed to make it straight to toplevel without that 
> process. Given that the spark extras are already ASF source files, you could 
> try the same thing, add all the existing committers, then look for volunteers 
> to keep things.

Am I to understand from your reply that it's not possible for a single
project to have multiple repos?

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Imran Rashid

On Fri, Mar 18, 2016 at 3:15 PM, Shane Curcuru  wrote:

> Question: why was the code removed from the Spark repo?  What's the harm
> in keeping it available here?

Assuming the Spark PMC has no plan on releasing the code, why would we keep
it in our codebase?  It only makes the codebase harder to navigate.  It
would be easy for someone to stumble on that code and expect it be part of
the release.  Seems like a general code-hygiene practice, eg. just like not
leaving a giant commented-out block of old code.

(But as you can see from the rest of the thread, I think that discussion on
whether it should still be part of Apache Spark is ongoing ...)

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Jean-Baptiste Onofré


Hi Marcelo,

I quickly discussed with Reynold this morning about this.

I share your concerns.

I fully understand that it's painful for users to wait a Spark releases 
to include fix in streaming backends as it's not really related.
It makes sense to provide backends "outside" of ASF, especially for 
legal issues: it's what we do at Camel with Camel-Extra.


Don't you think it could be interesting to have another ASF git repo 
dedicated to streaming backends, each backend can managed its release 
cycle following the ASF "rules" (staging, vote, ...) ?


Regards
JB

On 03/17/2016 07:14 PM, Marcelo Vanzin wrote:

Hello all,

Recently a lot of the streaming backends were moved to a separate
project on github and removed from the main Spark repo.

While I think the idea is great, I'm a little worried about the
execution. Some concerns were already raised on the bug mentioned
above, but I'd like to have a more explicit discussion about this so
things don't fall through the cracks.

Mainly I have three concerns.

i. Ownership

That code used to be run by the ASF, but now it's hosted in a github
repo owned not by the ASF. That sounds a little sub-optimal, if not
problematic.

ii. Governance

Similar to the above; who has commit access to the above repos? Will
all the Spark committers, present and future, have commit access to
all of those repos? Are they still going to be considered part of
Spark and have release management done through the Spark community?


For both of the questions above, why are they not turned into
sub-projects of Spark and hosted on the ASF repos? I believe there is
a mechanism to do that, without the need to keep the code in the main
Spark repo, right?

iii. Usability

This is another thing I don't see discussed. For Scala-based code
things don't change much, I guess, if the artifact names don't change
(another reason to keep things in the ASF?), but what about python?
How are pyspark users expected to get that code going forward, since
it's not in Spark's pyspark.zip anymore?


Is there an easy way of keeping these things within the ASF Spark
project? I think that would be better for everybody.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Mattmann, Chris A (3980)

Hi Marcelo,

Thanks for your reply. As a committer on the project, you *can* VETO
code. For sure. Unfortunately you don’t have a binding vote on adding
new PMC members/committers, and/or on releasing the software, but do
have the ability to VETO.

That said, if that’s not your intent, sorry for misreading your intent.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++





-Original Message-
From: Marcelo Vanzin 
Date: Friday, March 18, 2016 at 3:24 PM
To: jpluser 
Cc: "dev@spark.apache.org" 
Subject: Re: SPARK-13843 and future of streaming backends

>On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann 
>wrote:
>> So, my comment here is that any code *cannot* be removed from an Apache
>> project if there is a VETO issued which so far I haven't seen, though
>>maybe
>> Marcelo can clarify that.
>
>No, my intention was not to veto the change. I'm actually for the
>removal of components if the community thinks they don't add much to
>the project. (I'm also not sure I can even veto things, not being a
>PMC member.)
>
>I mainly wanted to know what was the path forward for those components
>because, with Cloudera's hat on, we care about one of them (streaming
>integration with flume), and we'd prefer if that code remained under
>the ASF umbrella in some way.
>
>-- 
>Marcelo

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Luciano Resende

On Fri, Mar 18, 2016 at 10:07 AM, Marcelo Vanzin 
wrote:

> Hi Steve, thanks for the write up.
>
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran 
> wrote:
> > If you want a separate project, eg. SPARK-EXTRAS, then it *generally*
> needs to go through incubation. While normally its the incubator PMC which
> sponsors/oversees the incubating project, it doesn't have to be the case:
> the spark project can do it.
> >
> > Also Apache Arrow managed to make it straight to toplevel without that
> process. Given that the spark extras are already ASF source files, you
> could try the same thing, add all the existing committers, then look for
> volunteers to keep things.
>
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
>
>
It can have multiple repos, but this still brings the overhead into the PMC
to maintain it which was brought on previously on this thread and it might
not be the direction the PMC want to take (but I might be mistaken).

Another approach is to make this extras, just a subproject, with it's own
set of committers etc which gives less burden on the Spark PMC.

Anyway, my main issue here is not who and how it's going to be managed, but
that it continues under Apache governance.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Cody Koeninger

Why would a PMC vote be necessary on every code deletion?

There was a Jira and pull request discussion about the submodules that
have been removed so far.

https://issues.apache.org/jira/browse/SPARK-13843

There's another ongoing one about Kafka specifically

https://issues.apache.org/jira/browse/SPARK-13877


On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan  wrote:
>
> I was not aware of a discussion in Dev list about this - agree with most of
> the observations.
> In addition, I did not see PMC signoff on moving (sub-)modules out.
>
> Regards
> Mridul
>
>
>
> On Thursday, March 17, 2016, Marcelo Vanzin  wrote:
>>
>> Hello all,
>>
>> Recently a lot of the streaming backends were moved to a separate
>> project on github and removed from the main Spark repo.
>>
>> While I think the idea is great, I'm a little worried about the
>> execution. Some concerns were already raised on the bug mentioned
>> above, but I'd like to have a more explicit discussion about this so
>> things don't fall through the cracks.
>>
>> Mainly I have three concerns.
>>
>> i. Ownership
>>
>> That code used to be run by the ASF, but now it's hosted in a github
>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>> problematic.
>>
>> ii. Governance
>>
>> Similar to the above; who has commit access to the above repos? Will
>> all the Spark committers, present and future, have commit access to
>> all of those repos? Are they still going to be considered part of
>> Spark and have release management done through the Spark community?
>>
>>
>> For both of the questions above, why are they not turned into
>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>> a mechanism to do that, without the need to keep the code in the main
>> Spark repo, right?
>>
>> iii. Usability
>>
>> This is another thing I don't see discussed. For Scala-based code
>> things don't change much, I guess, if the artifact names don't change
>> (another reason to keep things in the ASF?), but what about python?
>> How are pyspark users expected to get that code going forward, since
>> it's not in Spark's pyspark.zip anymore?
>>
>>
>> Is there an easy way of keeping these things within the ASF Spark
>> project? I think that would be better for everybody.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Cody Koeninger

>  Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with 
> a focus on extras with their own support channel.

To be clear, I didn't suggest that and don't think that's the best
solution.  I said to the people who want things done that way, which
committer is going to step up and do that organizational work?

I think there are advantages to moving everything currently in extras/
and external/ out of the spark project, but the current Kafka
packaging issue can be solved straightforwardly by just adding another
artifact and code tree under external/.

On Fri, Mar 18, 2016 at 5:04 AM, Steve Loughran  wrote:
>
> Spark has hit one of the enternal problems of OSS projects, one hit by: ant,
> maven, hadoop, ... anything with a plugin model.
>
> Take in the plugin: you're in control, but also down for maintenance
>
> Leave out the plugin: other people can maintain it, be more agile, etc.
>
> But you've lost control, and you can't even manage the links. Here I think
> maven suffered the most by keeping stuff in codehaus; migrating off there is
> still hard —not only did they lose the links: they lost the JIRA.
>
> Maven's relationship with codehaus was very tightly coupled, lots of
> committers on both; I don't know how that relationship was handled at a
> higher level.
>
>
> On 17 Mar 2016, at 20:51, Hari Shreedharan 
> wrote:
>
> I have worked with various ASF projects for 4+ years now. Sure, ASF projects
> can delete code as they feel fit. But this is the first time I have really
> seen code being "moved out" of a project without discussion. I am sure you
> can do this without violating ASF policy, but the explanation for that would
> be convoluted (someone decided to make a copy and then the ASF project
> deleted it?).
>
>
> +1 for discussion. Dev changes should -> dev list; PMC for process in
> general. Don't think the ASF will overlook stuff like that.
>
> Might want to raise this issue on the next broad report
>
>
> FWIW, it may be better to just see if you can have committers to work on
> these projects: recruit the people and say 'please, only work in this area
> —for now". That gets developers on your team, which is generally considered
> a metric of health in a project.
>
> Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with
> a focus on extras with their own support channel.
>
>
> Also, moving the code out would break compatibility. AFAIK, there is no way
> to push org.apache.* artifacts directly to maven central. That happens via
> mirroring from the ASF maven repos. Even if it you could somehow directly
> push the artifacts to mvn, you really can push to org.apache.* groups only
> if you are part of the repo and acting as an agent of that project (which in
> this case would be Apache Spark). Once you move the code out, even a
> committer/PMC member would not be representing the ASF when pushing the
> code. I am not sure if there is a way to fix this issue.
>
>
>
>
> This topic has cropped up in the general context of third party repos
> publishing artifacts with org.apache names but vendor specfic suffixes (e.g
> org.apache.hadoop/hadoop-common.5.3-cdh.jar
>
> Some people were pretty unhappy about this, but the conclusion reached was
> "maven doesn't let you do anything else and still let downstream people use
> it". Futhermore, as all ASF releases are nominally the source releases *not
> the binaries*, you can look at the POMs and say "we've released source code
> designed to publish artifacts to repos —this is 'use as intended'.
>
> People are also free to cut their own full project distributions, etc, etc.
> For example, I stick up the binaries of Windows builds independent of the
> ASF releases; these were originally just those from HDP on windows installs,
> now I check out the commit of the specific ASF release on a windows 2012 VM,
> do the build, copy the binaries. Free for all to use. But I do suspect that
> the ASF legal protections get a bit blurred here. These aren't ASF binaries,
> but binaries built directly from unmodified ASF releases.
>
> In contrast to sticking stuff into a github repo, the moved artifacts cannot
> be published as org.apache artfacts on maven central. That's non-negotiable
> as far as the ASF are concerned. The process for releasing ASF artifacts
> there goes downstream of the ASF public release process: you stage the
> artifacts, they are part of the vote process, everything with org.apache
> goes through it.
>
> That said: there is nothing to stop a set of shell org.apache artifacts
> being written which do nothing but contain transitive dependencies on
> artifacts in different groups, such as org.spark-project. The shells would
> be released by the ASF; they pull in the new stuff. And, therefore, it'd be
> possible to build a spark-assembly with the files. (I'm ignoring a loop in
> the build DAG here, playing with git submodules would let someone eliminate
> this by adding the removed libraries under a modified proje

Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Marcelo Vanzin

On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann  wrote:
> So, my comment here is that any code *cannot* be removed from an Apache
> project if there is a VETO issued which so far I haven't seen, though maybe
> Marcelo can clarify that.

No, my intention was not to veto the change. I'm actually for the
removal of components if the community thinks they don't add much to
the project. (I'm also not sure I can even veto things, not being a
PMC member.)

I mainly wanted to know what was the path forward for those components
because, with Cloudera's hat on, we care about one of them (streaming
integration with flume), and we'd prefer if that code remained under
the ASF umbrella in some way.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

64 matches

Mail list logo