Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Bobby Evans Wed, 26 Feb 2014 13:26:20 -0800

I totally agree and I am +1 on bringing these spout/trident pieces in, assuming 
there are committers to support them.


I am also curious about how people feel about pulling in other projects like 
storm-starter, storm-deploy, storm-mesos, and storm-yarn?

Storm-starter in my option seems more like documentation and it would be nice 
to pull in so that it stays up to date with storm itself, just like the 
documentation.

The others are more of ways to run storm in different environments.  They seem 
like there could be a lot of coupling between them and storm as storm evolves, 
and they kind of fit with "integrate storm with *Technology X*” except X in 
this case is a compute environment instead of a data source or store. But then 
again we also just shot down a request to create juju charms for storm.

—Bobby

From: "P. Taylor Goetz" <[email protected]<mailto:[email protected]>>
Reply-To: 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, February 26, 2014 at 1:21 PM
To: <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Thanks for the feedback Bobby.

To clarify, I’m mainly talking about spout/bolt/trident state implementations 
that integrate storm with *Technology X*, where *Technology X* is not a 
fundamental part of storm.

Examples would be technologies that are part of or related to the Hadoop/Big 
Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, 
Cassandra, etc.

The idea behind having one or more Storm committers act as a “sponsor” is to 
make sure new additions are done carefully and with good reason. To add a new 
module, it would require committer/PPMC consensus, and assignment of one or 
more sponsors. Part of a sponsor’s job would be to ensure that a module is 
maintained, which would require enough familiarity with the code so support it 
long term. If a new module was proposed, but no committers were willing to act 
as a sponsor, it would not be added.

It would be the Committers’/PPMC’s responsibly to make sure things didn’t get 
out of hand, and to do something about it if it does.

Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a 
contrib module, similar to what happened with HBase as Bobby pointed out. Some 
interesting points are brought up. The difference here is that both HBase and 
Hive were pretty big codebases relative to Hadoop. With spout/bolt/state 
implementations I doubt we’d see anything along that scale.

- Taylor

[1] https://issues.apache.org/jira/browse/HADOOP-3601


On Feb 26, 2014, at 12:35 PM, Bobby Evans 
<[email protected]<mailto:[email protected]>> wrote:

I can see a lot of value in having a distribution of storm that comes with 
batteries included, everything is tested together and you know it works.  But I 
don’t see much long term developer benefit in building them all together.  If 
there is strong coupling between storm and these external projects so that they 
break when storm changes then we need to understand the coupling and decide if 
we want to reduce that coupling by stabilizing APIs, improving version 
numbering and release process, etc.; or if the functionality is something that 
should be offered as a base service in storm.

I can see politically the value of giving these other projects a home in 
Apache, and making them sub-projects is the simplest route to that.  I’d love 
to have storm on yarn inside Apache.  I just don’t want to go overboard with 
it.  There was a time when HBase was a “contrib” module under Hadoop along with 
a lot of other things, and the Apache board came and told Hadoop to brake it up.

Bringing storm-kafka into storm does not sound like it will solve much from a 
developer’s perspective, because there is at least as much coupling with kafka 
as there is with storm.  I can see how it is a huge amount of overhead and pain 
to set up a new project just for a few hundred lines of code, as such I am in 
favor of pulling in closely related projects, especially those that are spouts 
and state implementations. I just want to be sure that we do it carefully, with 
a good reason, and with enough people who are familiar with the code to support 
it long term.

If it starts to look like we are pulling in too many projects perhaps we should 
look at something more like the bigtop project  https://bigtop.apache.org/ 
which produces a tested distribution of Hadoop with many different sub-projects 
included in it.

I am also a bit concerned about these sub-projects becoming second class 
citizens, where we break something, but because the build is off by default we 
don’t know it.  I would prefer that they are built and tested by default.  If 
the build and test time starts to take too long, to me that means we need to 
start wondering if we have too many contrib modules.

—Bobby

From: Brian Enochson 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]><mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Date: Tuesday, February 25, 2014 at 9:50 PM
To: 
"[email protected]<mailto:[email protected]><mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Cc: 
"[email protected]<mailto:[email protected]><mailto:[email protected]>"
 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

hi,
  I am in agreement with Taylor and believe I understand his intent. An 
incredible tool/framework/application like Storm is only enhanced and gains 
value from the number of well maintained and vetted modules that can be used 
for integration and adding further functionality.
 I am relatively new to the Storm community but have spent quite some time 
reviewing contributing modules out there, reviewing various duplicates and 
running into some version incompatibilities. I understand the need to keep 
Storm itself pure, but do think there needs to be some structure and governance 
added to the contributing modules. Look at the benefit a tool like npm brings 
to the node community.
 I like the idea of sponsorship, vetting and a community vote.  I, as sure many 
would be, am willing to offer support and time to working through how to set 
this up and helping with the implementation if it is decided to pursue some 
solution.
 I hope these views are taken in the sprit they are made, to make this 
incredible system even better along with the surrounding eco-system.

Thanks,
Brian


On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz 
<[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting 
that whatever a “contrib” project/module/subproject might  become, be a 
clearinghouse for anything Storm-related.

I see it as something that is well-vetted by the Storm community, subject to 
PPMC review, vote, etc. Entry would require community review, PPMC review, and 
in some cases ASF IP clearance/legal review. Anything added would require some 
level of commitment from the PPMC/committers to provide some level of support.

In other words, nothing “willy-nilly”.

One option could be that any module added require (X > 0)  number of committers 
to volunteer as “sponsor”s for the module, and commit to maintaining it.

That being said, I don’t see storm-kafka being any different from anything else 
that provides integration points for Storm.

-Taylor


On Feb 25, 2014, at 7:53 PM, Nathan Marz 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
 wrote:

I'm only +1 for pulling in storm-kafka and updating it. Other projects put 
these contrib modules in a "contrib" folder and keep them managed as completely 
separate codebases. As it's not actually a "module" necessary for Storm, 
there's an argument there for doing it that way rather than via the 
multi-module route.


On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
 wrote:
Hi Taylor,

I'm +1 for pulling these external libraries into Apache codebase. This
will certainly benifit Strom community. I also like to contribute to
this process.

Thanks
Milinda

On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz 
<[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
A while back I opened STORM-206 [1] to capture ideas for pulling in
"contrib" modules to the Apache codebase.

In the past, we had the storm-contrib github project [2] which subsequently
got broken up into individual projects hosted on the stormprocessor github
group [3] and elsewhere.

The problem with this approach is that in certain cases it led to code rot
(modules not being updated in step with Storm's API), fragmentation
(multiple similar modules with the same name), and confusion.

A good example of this is the storm-kafka module [4], since it is a widely
used component. Because storm-contrib wasn't being tagged in github, a lot
of users had trouble reconciling with which versions of storm it was
compatible. Some users built off specific commit hashes, some forked, and a
few even pushed custom builds to repositories such as clojars. With kafka
0.8 now available, there are two main storm-kafka projects, the original
(compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
0.8).

My intention is not to find fault in any way, but rather to point out the
resulting pain, and work toward a better solution.

I think it would be beneficial to the Storm user community to have certain
commonly used modules like storm-kafka brought into the Apache Storm
project. Another benefit worth considering is the licensing/legal oversight
that the ASF provides, which is important to many users.

If this is something we want to do, then the big question becomes what sort
governance process needs to be established to ensure that such things are
properly maintained.

Some random thoughts, questions, etc. that jump to mind include:

What to call these things: "contib modules", "connectors", "integration
modules", etc.?
Build integration: I imagine they would be a multi-module submodule of the
main maven build. Probably turned off by default and enabled by a maven
profile.
Governance: Have one or more committer volunteers responsible for
maintenance, merging patches, etc.? Proposal process for pulling new
modules?


I look forward to hearing others' opinions.

- Taylor


[1] https://issues.apache.org/jira/browse/STORM-206
[2] https://github.com/nathanmarz/storm-contrib
[3] https://github.com/stormprocessor
[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus



--
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>



--
Twitter: @nathanmarz
http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Reply via email to