Re: Flume Graduation (was Re: June reports in two weeks)
On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers to figure out what is actually going on is very high. Almost nothing is discussed directly on the dev list - it is all done through Jira issues or the Review tool. While all the Jira issue updates and reviews are sent to the dev list most of that is just noise. Feel free to review the dev list archives to see what I am talking about. I don't follow flume, but I'd propose to soften your objection only slightly. I've met other groups of people who like a JIRA centric view of the world. I suspect that if they did a bunch of other good things called out below, you or others would find the JIRA business digestible. Also, on the other hand, I fear that the co-employed contributors are collaborating in the hallway, and the lack of the context in JIRA or on the list is contributing to the problem. I have reason to doubt the collaboration in the hallway aspect and I certainly do not doubt everyone's good intent. I'm not objecting to the collaboration style as an issue preventing graduation. I'm just saying I find it difficult to participate with that style and that simply makes me wonder if that is making it harder to attract new committers. I fully realize that that issue might just be with me, but the fact remains that there is practically no diversity in the project and I cannot in good conscience recommend graduation for a project in that situation. Hi Ralph, Benson, et. al., some background: Flume is similar to Hadoop and other related projects in that it is very jira heavy for development activity. No slouch in terms of mailing list traffic either though (1200 last month): http://flume.markmail.org/ Also note the extensive new developer type detail that's available on the web/wiki: https://cwiki.apache.org/confluence/display/FLUME/Index The team list can provide insight into the diversity issue http://incubator.apache.org/flume/team-list.html My understanding is that there are at least 4 separate organizations represented by active commiters. The team list is incorrect and is somewhat misleading. To my knowledge at least two separate organizations represented in that list are now employed by Cloudera. Others signed on when the project entered the incubator but have never participated. This all became clear to me during the last release vote when, as I recall, I cast the only binding vote that didn't come from a Cloudera employee. Ralph Regards, Patrick Needless to say, when the graduation proposal reaches this list, and I'm sure it will, I will strongly endorse the IPMC to reject the proposal. FWIW, I found the post below to be 100% on target. Ralph On May 23, 2012, at 7:31 PM, Marvin Humphrey wrote: On Wed, May 23, 2012 at 5:36 PM, Patrick Hunt ph...@apache.org wrote: Perhaps someone will have some insight on how to gather new contributors that hasn't been tried yet? Jukka's written on this subject multiple times in the past. Here are two gems, one from a while back, the other recent: http://markmail.org/message/o3gbgam4ny2upqte Most of the cases I've been involved so far of podlings in the hoping some more people come along have had symptoms of the project team not paying enough attention on making it easy for new contributors to show up and stick around. Things like complex and undocumented build steps, missing Getting started or Getting involved guides, lack of quick and positive feedback to newcomers, etc., are all too common. Fixing even just some of such things will dramatically increase the odds of new people showing up. Those are things that are very easy to overlook when you're working on your first open source projects (it took me years to learn those lessons), but we here have a massive amount of collective experience on such things. That's what we could and IMHO should be sharing with the podlings. That's what mentoring to me is about and that's where our most precious added value is. Otherwise incubation just boils down to an indoctrination period on how to apply and conform to the various Apache
Re: Flume Graduation (was Re: June reports in two weeks)
On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers to figure out what is actually going on is very high. Almost nothing is discussed directly on the dev list - it is all done through Jira issues or the Review tool. While all the Jira issue updates and reviews are sent to the dev list most of that is just noise. Feel free to review the dev list archives to see what I am talking about. I don't follow flume, but I'd propose to soften your objection only slightly. I've met other groups of people who like a JIRA centric view of the world. I suspect that if they did a bunch of other good things called out below, you or others would find the JIRA business digestible. Also, on the other hand, I fear that the co-employed contributors are collaborating in the hallway, and the lack of the context in JIRA or on the list is contributing to the problem. I have reason to doubt the collaboration in the hallway aspect and I certainly do not doubt everyone's good intent. I'm not objecting to the collaboration style as an issue preventing graduation. I'm just saying I find it difficult to participate with that style and that simply makes me wonder if that is making it harder to attract new committers. I fully realize that that issue might just be with me, but the fact remains that there is practically no diversity in the project and I cannot in good conscience recommend graduation for a project in that situation. Hi Ralph, Benson, et. al., some background: Flume is similar to Hadoop and other related projects in that it is very jira heavy for development activity. No slouch in terms of mailing list traffic either though (1200 last month): http://flume.markmail.org/ Sorry I didn't include this in my prior post but here you are making my point exactly. I participate in several other Apache projects. Wading through 1200+ emails per month that are largely Jira/Review noise makes it very difficult for me to find posts that have any value. As a consequence I am largely forced to simply delete everything generated by he Review tool and Jira. And I'm a mentor. I just don't see how newcomers are going to find this style welcoming. Ralph - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Flume Graduation (was Re: June reports in two weeks)
On Wed, May 23, 2012 at 11:18 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers to figure out what is actually going on is very high. Almost nothing is discussed directly on the dev list - it is all done through Jira issues or the Review tool. While all the Jira issue updates and reviews are sent to the dev list most of that is just noise. Feel free to review the dev list archives to see what I am talking about. I don't follow flume, but I'd propose to soften your objection only slightly. I've met other groups of people who like a JIRA centric view of the world. I suspect that if they did a bunch of other good things called out below, you or others would find the JIRA business digestible. Also, on the other hand, I fear that the co-employed contributors are collaborating in the hallway, and the lack of the context in JIRA or on the list is contributing to the problem. I have reason to doubt the collaboration in the hallway aspect and I certainly do not doubt everyone's good intent. I'm not objecting to the collaboration style as an issue preventing graduation. I'm just saying I find it difficult to participate with that style and that simply makes me wonder if that is making it harder to attract new committers. I fully realize that that issue might just be with me, but the fact remains that there is practically no diversity in the project and I cannot in good conscience recommend graduation for a project in that situation. Hi Ralph, Benson, et. al., some background: Flume is similar to Hadoop and other related projects in that it is very jira heavy for development activity. No slouch in terms of mailing list traffic either though (1200 last month): http://flume.markmail.org/ Sorry I didn't include this in my prior post but here you are making my point exactly. I participate in several other Apache projects. Wading through 1200+ emails per month that are largely Jira/Review noise makes it very difficult for me to find posts that have any value. As a consequence I am largely forced to simply delete everything generated by he Review tool and Jira. And I'm a mentor. I just don't see how newcomers are going to find this style welcoming. There are separate lists it's just that markmail clubs them all together. It's also pretty easy to filter... Patrick - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Flume Graduation (was Re: June reports in two weeks)
On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: I'm just saying I find it difficult to participate with that style and that simply makes me wonder if that is making it harder to attract new committers. I suspect it attracts some and drives away others. I frikkin' hate JIRA notifications. The emails suck, so newcomers are forced to learn JIRA's interface before they can participate fully in the dev conversation. To me that seems like it raises a barrier to entry -- but then, there are numerous projects around the ASF who are not hurting for contributors and who use JIRA for *everything* -- starting with Hadoop and Lucene. If you don't want JIRA-centric development, it can be curtailed by sending notifications to a dedicated issues list instead of the dev list. However, I would not necessarily recommend that to a new podling, as I can't tell where my own biases end and I don't want to start a phpBB^H^H^H^H^HJIRA vs email flame war. -- Marvin Humphrey, who in moments of weakness fantasizes about the day when Infra can no longer keep the massive ASF JIRA instance from toppling over and all the Java projects whose participation in Infra is limited to complaining when stuff goes offline come crying. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Flume Graduation (was Re: June reports in two weeks)
I appreciate your position Ralph and I don't want anyone to feel like they can't contribute. As we've talked about before, we've been quick to nurture new contributors to committer status successfully in a few cases. It's true that some of the more active committers are from Cloudera, but it's not to the exclusion of anyone. Others aren't from Cloudera. Those of us that work together are also very strict about abiding to the if it's not on the mailing list, it didn't happen rule (where mailing list can mean JIRA or other ASF infrastructure as well). I'm happy to take your guidance as a mentor, but you also need to understand that some of the ways the Flume project has elected to operate are just a matter of taste. They were proposed, discussed, voted on (and not as a block by Cloudera employees, IIRC - pretty sure I was -0), and put in place and do not violate the Apache Way (like RTC vs. CTR). They aren't unheard of and they do not work to the exclusion of contributors (RTC, for instance, only impacts committers). I think the vote that was started was only to gauge community opinion as a first step (although I'm not completely well versed in the graduation process, to be honest). If there are concrete things we can do to improve diversity, in your opinion, I am extremely open to hearing them. We already do many of the (excellent) things listed earlier in the thread. JIRA noise withstanding (again, it's a matter of taste - I use the email frequently as I find trolling through JIRA slow) I'm definitely open to ideas. Of course, if Flume simply needs to remain in the incubator until we develop greater diversity, that's fine too. If we're not ready, we're just not ready. On Wed, May 23, 2012 at 11:18 PM, Ralph Goers ralph.go...@dslextreme.comwrote: On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers to figure out what is actually going on is very high. Almost nothing is discussed directly on the dev list - it is all done through Jira issues or the Review tool. While all the Jira issue updates and reviews are sent to the dev list most of that is just noise. Feel free to review the dev list archives to see what I am talking about. I don't follow flume, but I'd propose to soften your objection only slightly. I've met other groups of people who like a JIRA centric view of the world. I suspect that if they did a bunch of other good things called out below, you or others would find the JIRA business digestible. Also, on the other hand, I fear that the co-employed contributors are collaborating in the hallway, and the lack of the context in JIRA or on the list is contributing to the problem. I have reason to doubt the collaboration in the hallway aspect and I certainly do not doubt everyone's good intent. I'm not objecting to the collaboration style as an issue preventing graduation. I'm just saying I find it difficult to participate with that style and that simply makes me wonder if that is making it harder to attract new committers. I fully realize that that issue might just be with me, but the fact remains that there is practically no diversity in the project and I cannot in good conscience recommend graduation for a project in that situation. Hi Ralph, Benson, et. al., some background: Flume is similar to Hadoop and other related projects in that it is very jira heavy for development activity. No slouch in terms of mailing list traffic either though (1200 last month): http://flume.markmail.org/ Sorry I didn't include this in my prior post but here you are making my point exactly. I participate in several other Apache projects. Wading through 1200+ emails per month that are largely Jira/Review noise makes it very difficult for me to find posts that have any value. As a consequence I am largely forced to simply delete everything generated by he Review tool and Jira. And I'm a mentor. I just don't see how newcomers are going to find this style welcoming. Ralph - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Eric Sammer twitter:
Re: Flume Graduation (was Re: June reports in two weeks)
The ONLY issue I see for Flume to graduate is diversity. No one will convince me that the current makeup constitutes diversity of any kind. Perhaps I shouldn't have brought up the mailing list issues as that was only meant in the spirit of trying to offer some advice on how more diversity could be achieved. Flume is really the only community I participate in that contains Cloudera employees so I do find myself wondering if the way the project is run is because that is the way all projects with a large number of Cloudera employees are run. That might make all of those participants comfortable but might create a barrier to others. In any case - I'm not insisting that the way the project is run needs to change. I'm simply saying I cannot support graduation with the current makeup of the committers and PMC. I don't have a hard and fast ratio - gaining 10 new unaffiliated committers who don't do much isn't nearly as good as 2 or 3 who are very active. Ultimately the project needs to figure out how to solve this. Ralph On May 23, 2012, at 11:48 PM, Eric Sammer wrote: I appreciate your position Ralph and I don't want anyone to feel like they can't contribute. As we've talked about before, we've been quick to nurture new contributors to committer status successfully in a few cases. It's true that some of the more active committers are from Cloudera, but it's not to the exclusion of anyone. Others aren't from Cloudera. Those of us that work together are also very strict about abiding to the if it's not on the mailing list, it didn't happen rule (where mailing list can mean JIRA or other ASF infrastructure as well). I'm happy to take your guidance as a mentor, but you also need to understand that some of the ways the Flume project has elected to operate are just a matter of taste. They were proposed, discussed, voted on (and not as a block by Cloudera employees, IIRC - pretty sure I was -0), and put in place and do not violate the Apache Way (like RTC vs. CTR). They aren't unheard of and they do not work to the exclusion of contributors (RTC, for instance, only impacts committers). I think the vote that was started was only to gauge community opinion as a first step (although I'm not completely well versed in the graduation process, to be honest). If there are concrete things we can do to improve diversity, in your opinion, I am extremely open to hearing them. We already do many of the (excellent) things listed earlier in the thread. JIRA noise withstanding (again, it's a matter of taste - I use the email frequently as I find trolling through JIRA slow) I'm definitely open to ideas. Of course, if Flume simply needs to remain in the incubator until we develop greater diversity, that's fine too. If we're not ready, we're just not ready. On Wed, May 23, 2012 at 11:18 PM, Ralph Goers ralph.go...@dslextreme.comwrote: On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers to figure out what is actually going on is very high. Almost nothing is discussed directly on the dev list - it is all done through Jira issues or the Review tool. While all the Jira issue updates and reviews are sent to the dev list most of that is just noise. Feel free to review the dev list archives to see what I am talking about. I don't follow flume, but I'd propose to soften your objection only slightly. I've met other groups of people who like a JIRA centric view of the world. I suspect that if they did a bunch of other good things called out below, you or others would find the JIRA business digestible. Also, on the other hand, I fear that the co-employed contributors are collaborating in the hallway, and the lack of the context in JIRA or on the list is contributing to the problem. I have reason to doubt the collaboration in the hallway aspect and I certainly do not doubt everyone's good intent. I'm not objecting to the collaboration style as an issue preventing graduation. I'm just saying I find it difficult to participate with that style and that simply makes me wonder if that is making it harder to attract new committers. I fully realize that that issue might just be with me, but the fact remains that there is
Re: [VOTE] Accept Crunch into the Apache Incubator
[X ] +1, bring Crunch into Incubator -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Crunch into the Apache Incubator
+1 Tommaso 2012/5/23 Josh Wills jwi...@cloudera.com I would like to call a vote for accepting Apache Crunch for incubation in the Apache Incubator. The full proposal is available below. We ask the Incubator PMC to sponsor it, with phunt as Champion, and phunt, tomwhite, and acmurthy volunteering to be Mentors. Please cast your vote: [ ] +1, bring Crunch into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Crunch into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. http://wiki.apache.org/incubator/CrunchProposal Proposal text from the wiki: -- = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = == Abstract == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. == Proposal == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex !MapReduce jobs that require multiple processing stages. It has a simple, flexible, and extensible data model that makes it ideal for processing data that does not naturally fit into a relational structure, such as time series and serialized object formats like JSON and Avro. It supports running pipelines either as a series of !MapReduce jobs on an Apache Hadoop cluster or in memory on a single machine for fast testing and debugging. == Background == Crunch was initially developed by Cloudera to simplify the process of creating sequences of dependent !MapReduce jobs, especially jobs that processed non-relational data like time series. Its design was based on a paper Google published about a Java library they developed called !FlumeJava that was created in order to solve a similar class of problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache 2.0 licensed project in October 2011. During this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 (February 2012), with an incremental update to version 0.2.1 (March 2012) . These releases are also distributed by Cloudera as source and binaries from Cloudera's Maven repository. == Rationale == Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop cluster require a series of !MapReduce jobs to be executed in sequence. Developers who are creating these pipelines today need to manually assign the sequence of tasks to perform in a dependent chain of !MapReduce jobs, even though there are a number of well-known patterns for fusing dependent computations together into a single !MapReduce stage and for performing common types of joins and aggregations. This results in !MapReduce pipelines that are more difficult to test, maintain, and extend to support new functionality. Furthermore, the type of data that is being stored and processed using Apache Hadoop is evolving. Although Hadoop was originally used for storing large volumes of structured text in the form of webpages and log files, it is now common for Hadoop to store complex, structured data formats such as JSON, Apache Avro, and Apache Thrift. These formats allow developers to work with serialized objects in programming languages like Java, C++, and Python, and allow for new types of analysis to be performed on complex data types. Hadoop has also been adopted by the scientific research community, who are using Hadoop to process time series data, structured binary files in the HDF5 format, and large medical and satellite images. Crunch addresses these challenges by providing a lightweight and extensible Java API for defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop cluster as a sequence of dependent !MapReduce jobs, or in-memory on a single machine to facilitate fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent immutable, distributed collections of objects. Developers define functions that are applied to those objects in order to generate new immutable, distributed collections of objects. Crunch also provides a library of common !MapReduce patterns for performing efficient joins and aggregation operations over these distributed collections that developers may integrate into their own pipelines. Crunch also provides native support for processing structured binary data formats like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working with any kind of data format that Java supports in its native form. == Initial Goals == Crunch is currently in its first major release with a considerable number of enhancement requests, tasks, and issues recorded towards its future development. The initial goal
[RESULT] [VOTE] Release Apache Wookie 0.10.0-incubating (General Incubation List)
The 72 hour voting period has passed and the vote is now closed. Thanks to everyone who took time to review the release. With the three IPMC member votes (3 of them mentors) and 3 PPMC votes the vote succeeds IPMC Member voting record: * Ate Douma: +1 * Ross Gardler +1 * Matt Franklin +1 * Denotes an IPMC member vote cast on the wookie-dev list. Thanks, Scott. On 21 May 2012, at 16:13, Scott Wilson wrote: This is the third incubator release for Apache Wookie, with the artifacts being versioned as 0.10.0-incubating. We are requesting a lazy consensus vote, as we have already received 3 binding IPMC +1 votes during the release voting on wookie-dev - Vote thread: http://markmail.org/message/2p4veen6n22w7hnb Result: http://markmail.org/message/d2jzbrdgic3od5uj Svn source tag: https://svn.apache.org/repos/asf/incubator/wookie/tags/0.10.0-incubating/ Release notes: https://svn.apache.org/repos/asf/incubator/wookie/tags/0.10.0-incubating/RELEASE_NOTES Release artifacts: http://people.apache.org/builds/incubator/wookie/0.10.0-incubating/ Maven artifacts https://repository.apache.org/content/repositories/orgapachewookie-094/ PGP release keys: https://svn.apache.org/repos/asf/incubator/wookie/KEYS Lazy consensus, vote open for 72 hours. [ ] +1 approve [ ] +0 no opinion [ ] -1 disapprove (and reason why) - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: [VOTE] Accept Crunch into the Apache Incubator
+1 (binding) -Original Message- From: Josh Wills [mailto:jwi...@cloudera.com] Sent: Wednesday, May 23, 2012 2:46 PM To: general@incubator.apache.org Subject: [VOTE] Accept Crunch into the Apache Incubator I would like to call a vote for accepting Apache Crunch for incubation in the Apache Incubator. The full proposal is available below. We ask the Incubator PMC to sponsor it, with phunt as Champion, and phunt, tomwhite, and acmurthy volunteering to be Mentors. Please cast your vote: [ ] +1, bring Crunch into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Crunch into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. http://wiki.apache.org/incubator/CrunchProposal Proposal text from the wiki: --- --- = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = == Abstract == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. == Proposal == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex !MapReduce jobs that require multiple processing stages. It has a simple, flexible, and extensible data model that makes it ideal for processing data that does not naturally fit into a relational structure, such as time series and serialized object formats like JSON and Avro. It supports running pipelines either as a series of !MapReduce jobs on an Apache Hadoop cluster or in memory on a single machine for fast testing and debugging. == Background == Crunch was initially developed by Cloudera to simplify the process of creating sequences of dependent !MapReduce jobs, especially jobs that processed non-relational data like time series. Its design was based on a paper Google published about a Java library they developed called !FlumeJava that was created in order to solve a similar class of problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache 2.0 licensed project in October 2011. During this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 (February 2012), with an incremental update to version 0.2.1 (March 2012) . These releases are also distributed by Cloudera as source and binaries from Cloudera's Maven repository. == Rationale == Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop cluster require a series of !MapReduce jobs to be executed in sequence. Developers who are creating these pipelines today need to manually assign the sequence of tasks to perform in a dependent chain of !MapReduce jobs, even though there are a number of well-known patterns for fusing dependent computations together into a single !MapReduce stage and for performing common types of joins and aggregations. This results in !MapReduce pipelines that are more difficult to test, maintain, and extend to support new functionality. Furthermore, the type of data that is being stored and processed using Apache Hadoop is evolving. Although Hadoop was originally used for storing large volumes of structured text in the form of webpages and log files, it is now common for Hadoop to store complex, structured data formats such as JSON, Apache Avro, and Apache Thrift. These formats allow developers to work with serialized objects in programming languages like Java, C++, and Python, and allow for new types of analysis to be performed on complex data types. Hadoop has also been adopted by the scientific research community, who are using Hadoop to process time series data, structured binary files in the HDF5 format, and large medical and satellite images. Crunch addresses these challenges by providing a lightweight and extensible Java API for defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop cluster as a sequence of dependent !MapReduce jobs, or in-memory on a single machine to facilitate fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent immutable, distributed collections of objects. Developers define functions that are applied to those objects in order to generate new immutable, distributed collections of objects. Crunch also provides a library of common !MapReduce patterns for performing efficient joins and aggregation operations over these distributed collections that developers may integrate into their own pipelines. Crunch also provides native support for processing structured binary data formats like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working with any kind of data format that Java supports in its native form. == Initial Goals == Crunch is currently in its first major release with a considerable number of enhancement
Re: [VOTE] Accept Crunch into the Apache Incubator
+1 (binding) ... And a friendly reminder to the ppmc via their mentors -- in response to that email about limiting the initial committers to a tight group. They will soon be learning that the big challenge of incubation is not writing a lot of code, its recruiting new faces. They'll want to switch from 'just us chickens' to putting out the welcome mat as soon as possible. On Thu, May 24, 2012 at 2:29 AM, Bertrand Delacretaz bdelacre...@apache.org wrote: [X ] +1, bring Crunch into Incubator -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
[ANNOUNCE] Apache Wink 1.2.0-incubating release
The Apache Wink team is pleased to announce the release of Apache Wink 1.2.0-incubating. Apache Wink is a simple yet solid framework for building RESTful Web services. It is comprised of a Server module and a Client module for developing and consuming RESTful Web services. The Wink Server module is a complete implementation of the JAX-RS v1.1 specification. On top of this implementation, the Wink Server module provides a set of additional features that were designed to facilitate the development of RESTful Web services. The Wink Client module is a Java based framework that provides functionality for communicating with RESTful Web services. The framework is built on top of the JDK HttpURLConnection and adds essential features that facilitate the development of such client applications. For full details about the release and to download the distributions please go to: http://incubator.apache.org/wink/downloads.html Apache Wink welcomes your help. Any contribution, including code, testing, contributions to the documentation, or bug reporting is always appreciated. For more information on how to get involved in Apache Wink visit the website at: http://incubator.apache.org/wink/ Thank you for your interest in Apache Wink! The Apache Wink Team. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Crunch into the Apache Incubator
[x] +1, bring Crunch into Incubator (non-binding) Regards, Mike On Wednesday, May 23, 2012 at 11:45 AM, Josh Wills wrote: I would like to call a vote for accepting Apache Crunch for incubation in the Apache Incubator. The full proposal is available below. We ask the Incubator PMC to sponsor it, with phunt as Champion, and phunt, tomwhite, and acmurthy volunteering to be Mentors. Please cast your vote: [ ] +1, bring Crunch into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Crunch into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. http://wiki.apache.org/incubator/CrunchProposal Proposal text from the wiki: -- = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = == Abstract == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. == Proposal == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex !MapReduce jobs that require multiple processing stages. It has a simple, flexible, and extensible data model that makes it ideal for processing data that does not naturally fit into a relational structure, such as time series and serialized object formats like JSON and Avro. It supports running pipelines either as a series of !MapReduce jobs on an Apache Hadoop cluster or in memory on a single machine for fast testing and debugging. == Background == Crunch was initially developed by Cloudera to simplify the process of creating sequences of dependent !MapReduce jobs, especially jobs that processed non-relational data like time series. Its design was based on a paper Google published about a Java library they developed called !FlumeJava that was created in order to solve a similar class of problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache 2.0 licensed project in October 2011. During this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 (February 2012), with an incremental update to version 0.2.1 (March 2012) . These releases are also distributed by Cloudera as source and binaries from Cloudera's Maven repository. == Rationale == Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop cluster require a series of !MapReduce jobs to be executed in sequence. Developers who are creating these pipelines today need to manually assign the sequence of tasks to perform in a dependent chain of !MapReduce jobs, even though there are a number of well-known patterns for fusing dependent computations together into a single !MapReduce stage and for performing common types of joins and aggregations. This results in !MapReduce pipelines that are more difficult to test, maintain, and extend to support new functionality. Furthermore, the type of data that is being stored and processed using Apache Hadoop is evolving. Although Hadoop was originally used for storing large volumes of structured text in the form of webpages and log files, it is now common for Hadoop to store complex, structured data formats such as JSON, Apache Avro, and Apache Thrift. These formats allow developers to work with serialized objects in programming languages like Java, C++, and Python, and allow for new types of analysis to be performed on complex data types. Hadoop has also been adopted by the scientific research community, who are using Hadoop to process time series data, structured binary files in the HDF5 format, and large medical and satellite images. Crunch addresses these challenges by providing a lightweight and extensible Java API for defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop cluster as a sequence of dependent !MapReduce jobs, or in-memory on a single machine to facilitate fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent immutable, distributed collections of objects. Developers define functions that are applied to those objects in order to generate new immutable, distributed collections of objects. Crunch also provides a library of common !MapReduce patterns for performing efficient joins and aggregation operations over these distributed collections that developers may integrate into their own pipelines. Crunch also provides native support for processing structured binary data formats like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working with any kind of data format that Java supports in its native form. == Initial Goals == Crunch is currently in its first major release with a considerable number of enhancement requests, tasks,
Re: [VOTE] Accept Crunch into the Apache Incubator
+1 (binding) On May 23, 2012, at 11:45 AM, Josh Wills wrote: I would like to call a vote for accepting Apache Crunch for incubation in the Apache Incubator. The full proposal is available below. We ask the Incubator PMC to sponsor it, with phunt as Champion, and phunt, tomwhite, and acmurthy volunteering to be Mentors. Please cast your vote: [ ] +1, bring Crunch into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Crunch into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. http://wiki.apache.org/incubator/CrunchProposal Proposal text from the wiki: -- = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = == Abstract == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. == Proposal == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex !MapReduce jobs that require multiple processing stages. It has a simple, flexible, and extensible data model that makes it ideal for processing data that does not naturally fit into a relational structure, such as time series and serialized object formats like JSON and Avro. It supports running pipelines either as a series of !MapReduce jobs on an Apache Hadoop cluster or in memory on a single machine for fast testing and debugging. == Background == Crunch was initially developed by Cloudera to simplify the process of creating sequences of dependent !MapReduce jobs, especially jobs that processed non-relational data like time series. Its design was based on a paper Google published about a Java library they developed called !FlumeJava that was created in order to solve a similar class of problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache 2.0 licensed project in October 2011. During this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 (February 2012), with an incremental update to version 0.2.1 (March 2012) . These releases are also distributed by Cloudera as source and binaries from Cloudera's Maven repository. == Rationale == Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop cluster require a series of !MapReduce jobs to be executed in sequence. Developers who are creating these pipelines today need to manually assign the sequence of tasks to perform in a dependent chain of !MapReduce jobs, even though there are a number of well-known patterns for fusing dependent computations together into a single !MapReduce stage and for performing common types of joins and aggregations. This results in !MapReduce pipelines that are more difficult to test, maintain, and extend to support new functionality. Furthermore, the type of data that is being stored and processed using Apache Hadoop is evolving. Although Hadoop was originally used for storing large volumes of structured text in the form of webpages and log files, it is now common for Hadoop to store complex, structured data formats such as JSON, Apache Avro, and Apache Thrift. These formats allow developers to work with serialized objects in programming languages like Java, C++, and Python, and allow for new types of analysis to be performed on complex data types. Hadoop has also been adopted by the scientific research community, who are using Hadoop to process time series data, structured binary files in the HDF5 format, and large medical and satellite images. Crunch addresses these challenges by providing a lightweight and extensible Java API for defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop cluster as a sequence of dependent !MapReduce jobs, or in-memory on a single machine to facilitate fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent immutable, distributed collections of objects. Developers define functions that are applied to those objects in order to generate new immutable, distributed collections of objects. Crunch also provides a library of common !MapReduce patterns for performing efficient joins and aggregation operations over these distributed collections that developers may integrate into their own pipelines. Crunch also provides native support for processing structured binary data formats like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working with any kind of data format that Java supports in its native form. == Initial Goals == Crunch is currently in its first major release with a considerable number of enhancement requests, tasks, and issues recorded towards its future
Re: Flume Graduation (was Re: June reports in two weeks)
On May 24, 2012, at 12:20 AM, Ralph Goers ralph.go...@dslextreme.com wrote: The ONLY issue I see for Flume to graduate is diversity. No one will convince me that the current makeup constitutes diversity of any kind. Perhaps I shouldn't have brought up the mailing list issues as that was only meant in the spirit of trying to offer some advice on how more diversity could be achieved. Flume is really the only community I participate in that contains Cloudera employees so I do find myself wondering if the way the project is run is because that is the way all projects with a large number of Cloudera employees are run. That might make all of those participants comfortable but might create a barrier to others. There are others where this is the case that are easily referenceable. There's an obvious (to me) implication that this is the cause of the problem and that's simply not true. If there are concrete recommendations of things you feel we can do better I know the flume community is open to those sightings. There's no practice in place within flume that isn't in place in some other ASF TLP to my knowledge. In any case - I'm not insisting that the way the project is run needs to change. I'm simply saying I cannot support graduation with the current makeup of the committers and PMC. I don't have a hard and fast ratio - gaining 10 new unaffiliated committers who don't do much isn't nearly as good as 2 or 3 who are very active. Ultimately the project needs to figure out how to solve this. That's fine. So let's have a discussion about actionable tasks. I've mentioned my thoughts on growing diversity in the past, although admittedly it was within a response to a similar thread on our private list. I'll start a thread on our dev list with the same thoughts for the larger community to comment on. I welcome your contribution to such a discussion! Thanks. Ralph On May 23, 2012, at 11:48 PM, Eric Sammer wrote: I appreciate your position Ralph and I don't want anyone to feel like they can't contribute. As we've talked about before, we've been quick to nurture new contributors to committer status successfully in a few cases. It's true that some of the more active committers are from Cloudera, but it's not to the exclusion of anyone. Others aren't from Cloudera. Those of us that work together are also very strict about abiding to the if it's not on the mailing list, it didn't happen rule (where mailing list can mean JIRA or other ASF infrastructure as well). I'm happy to take your guidance as a mentor, but you also need to understand that some of the ways the Flume project has elected to operate are just a matter of taste. They were proposed, discussed, voted on (and not as a block by Cloudera employees, IIRC - pretty sure I was -0), and put in place and do not violate the Apache Way (like RTC vs. CTR). They aren't unheard of and they do not work to the exclusion of contributors (RTC, for instance, only impacts committers). I think the vote that was started was only to gauge community opinion as a first step (although I'm not completely well versed in the graduation process, to be honest). If there are concrete things we can do to improve diversity, in your opinion, I am extremely open to hearing them. We already do many of the (excellent) things listed earlier in the thread. JIRA noise withstanding (again, it's a matter of taste - I use the email frequently as I find trolling through JIRA slow) I'm definitely open to ideas. Of course, if Flume simply needs to remain in the incubator until we develop greater diversity, that's fine too. If we're not ready, we're just not ready. On Wed, May 23, 2012 at 11:18 PM, Ralph Goers ralph.go...@dslextreme.comwrote: On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers to figure out what is actually going on is very high. Almost nothing is discussed directly on the dev list - it is all done through Jira issues or the Review tool. While all the Jira issue updates and reviews are sent to the dev list most of that is just noise. Feel free to review the dev list archives to see what I am talking about. I don't follow flume, but I'd propose to soften your
Re: Flume Graduation (was Re: June reports in two weeks)
Hi, On Thu, May 24, 2012 at 12:19 AM, Ralph Goers ralph.go...@dslextreme.comwrote: The ONLY issue I see for Flume to graduate is diversity. No one will convince me that the current makeup constitutes diversity of any kind. Perhaps I shouldn't have brought up the mailing list issues as that was only meant in the spirit of trying to offer some advice on how more diversity could be achieved. Flume is really the only community I participate in that contains Cloudera employees so I do find myself wondering if the way the project is run is because that is the way all projects with a large number of Cloudera employees are run. That might make all of those participants comfortable but might create a barrier to others. Here are the committers who have been active in the past three months: * Brock Noland (Cloudera) * Hari Shreedharan (Cloudera) * Jarek Jarcec Cecho (AVG Technologies) * Juhani Connolly (CyberAgent) * Mike Percy (Cloudera) * Mingjie Lai (Trend Micro) * Prasad Mujumdar (Cloudera) * Will McQueen (Cloudera) * Arvind Prabhakar (Cloudera) There are four companies represented in this list: AVG Technologies, Cloudera, CyberAgent and Trend Micro. Compared to other projects that have successfully graduated from Incubator in the past, this meets the diversity requirements very well. In any case - I'm not insisting that the way the project is run needs to change. I'm simply saying I cannot support graduation with the current makeup of the committers and PMC. I don't have a hard and fast ratio - gaining 10 new unaffiliated committers who don't do much isn't nearly as good as 2 or 3 who are very active. Ultimately the project needs to figure out how to solve this. Stating that some committers who don't do much isn't nearly as good as 2 or 3 who are very active is an unfair characterization. This is more unfair for those who are part of the project but have not been active lately due to whatever reasons, but have played a foundational role in getting the project to a point where it is today. I think they are as important as any other committer who may be very active at the moment. Merit once earned, never expires [1]. [1] http://www.apache.org/dev/committers.html#committer-set-term Arvind Ralph On May 23, 2012, at 11:48 PM, Eric Sammer wrote: I appreciate your position Ralph and I don't want anyone to feel like they can't contribute. As we've talked about before, we've been quick to nurture new contributors to committer status successfully in a few cases. It's true that some of the more active committers are from Cloudera, but it's not to the exclusion of anyone. Others aren't from Cloudera. Those of us that work together are also very strict about abiding to the if it's not on the mailing list, it didn't happen rule (where mailing list can mean JIRA or other ASF infrastructure as well). I'm happy to take your guidance as a mentor, but you also need to understand that some of the ways the Flume project has elected to operate are just a matter of taste. They were proposed, discussed, voted on (and not as a block by Cloudera employees, IIRC - pretty sure I was -0), and put in place and do not violate the Apache Way (like RTC vs. CTR). They aren't unheard of and they do not work to the exclusion of contributors (RTC, for instance, only impacts committers). I think the vote that was started was only to gauge community opinion as a first step (although I'm not completely well versed in the graduation process, to be honest). If there are concrete things we can do to improve diversity, in your opinion, I am extremely open to hearing them. We already do many of the (excellent) things listed earlier in the thread. JIRA noise withstanding (again, it's a matter of taste - I use the email frequently as I find trolling through JIRA slow) I'm definitely open to ideas. Of course, if Flume simply needs to remain in the incubator until we develop greater diversity, that's fine too. If we're not ready, we're just not ready. On Wed, May 23, 2012 at 11:18 PM, Ralph Goers ralph.go...@dslextreme.comwrote: On May 23, 2012, at 10:48 PM, Patrick Hunt wrote: On Wed, May 23, 2012 at 10:36 PM, Ralph Goers ralph.go...@dslextreme.com wrote: On May 23, 2012, at 10:15 PM, Benson Margulies wrote: On Wed, May 23, 2012 at 10:09 PM, Ralph Goers ralph.go...@dslextreme.com wrote: Right after I read Jukka's email that started this thread and I posted my reply and discovered to my shock that they had started a graduation vote. I am shocked because I have pointed out repeatedly the project's complete lack of diversity. Virtually all the active PMC members and committers work for the same employer. I have told them several times that I would actually like to participate in the project but the way the project works is very different then every other project I am involved with at the ASF and the barriers
Re: Flume Graduation (was Re: June reports in two weeks)
On May 24, 2012, at 10:40 AM, Arvind Prabhakar wrote: Hi, On Thu, May 24, 2012 at 12:19 AM, Ralph Goers ralph.go...@dslextreme.comwrote: The ONLY issue I see for Flume to graduate is diversity. No one will convince me that the current makeup constitutes diversity of any kind. Perhaps I shouldn't have brought up the mailing list issues as that was only meant in the spirit of trying to offer some advice on how more diversity could be achieved. Flume is really the only community I participate in that contains Cloudera employees so I do find myself wondering if the way the project is run is because that is the way all projects with a large number of Cloudera employees are run. That might make all of those participants comfortable but might create a barrier to others. Here are the committers who have been active in the past three months: * Brock Noland (Cloudera) * Hari Shreedharan (Cloudera) * Jarek Jarcec Cecho (AVG Technologies) * Juhani Connolly (CyberAgent) * Mike Percy (Cloudera) * Mingjie Lai (Trend Micro) * Prasad Mujumdar (Cloudera) * Will McQueen (Cloudera) * Arvind Prabhakar (Cloudera) There are four companies represented in this list: AVG Technologies, Cloudera, CyberAgent and Trend Micro. Compared to other projects that have successfully graduated from Incubator in the past, this meets the diversity requirements very well. I was mistaken and the list above is indeed correct. For some reason I thought a couple of them had become Cloudera employees. However, none of those three are currently on the PPMC. When you look at the PPMC list you should also include a few more Cloudera people who do participate in release votes and PPMC issues. Most, if not all, of the non-Cloudera PMC members don't. In any case - I'm not insisting that the way the project is run needs to change. I'm simply saying I cannot support graduation with the current makeup of the committers and PMC. I don't have a hard and fast ratio - gaining 10 new unaffiliated committers who don't do much isn't nearly as good as 2 or 3 who are very active. Ultimately the project needs to figure out how to solve this. Stating that some committers who don't do much isn't nearly as good as 2 or 3 who are very active is an unfair characterization. This is more unfair for those who are part of the project but have not been active lately due to whatever reasons, but have played a foundational role in getting the project to a point where it is today. I think they are as important as any other committer who may be very active at the moment. Merit once earned, never expires [1]. [1] http://www.apache.org/dev/committers.html#committer-set-term I think you misunderstood my point or I didn't state it very well. Diversity isn't achieved simply by having bodies. IOW I am not suggesting offering commit rights to people who haven't earned it just to meet some ratio. However, I am not suggesting the project has ever even considered doing that. Ralph - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Crunch into the Apache Incubator
+1 Doug On 05/23/2012 11:45 AM, Josh Wills wrote: I would like to call a vote for accepting Apache Crunch for incubation in the Apache Incubator. The full proposal is available below. We ask the Incubator PMC to sponsor it, with phunt as Champion, and phunt, tomwhite, and acmurthy volunteering to be Mentors. Please cast your vote: [ ] +1, bring Crunch into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Crunch into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. http://wiki.apache.org/incubator/CrunchProposal Proposal text from the wiki: -- = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = == Abstract == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. == Proposal == Crunch is a Java library for writing, testing, and running pipelines of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a high-level API for writing and testing complex !MapReduce jobs that require multiple processing stages. It has a simple, flexible, and extensible data model that makes it ideal for processing data that does not naturally fit into a relational structure, such as time series and serialized object formats like JSON and Avro. It supports running pipelines either as a series of !MapReduce jobs on an Apache Hadoop cluster or in memory on a single machine for fast testing and debugging. == Background == Crunch was initially developed by Cloudera to simplify the process of creating sequences of dependent !MapReduce jobs, especially jobs that processed non-relational data like time series. Its design was based on a paper Google published about a Java library they developed called !FlumeJava that was created in order to solve a similar class of problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache 2.0 licensed project in October 2011. During this time Crunch has been formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 (February 2012), with an incremental update to version 0.2.1 (March 2012) . These releases are also distributed by Cloudera as source and binaries from Cloudera's Maven repository. == Rationale == Most of the interesting analytical and data processing tasks that are run on an Apache Hadoop cluster require a series of !MapReduce jobs to be executed in sequence. Developers who are creating these pipelines today need to manually assign the sequence of tasks to perform in a dependent chain of !MapReduce jobs, even though there are a number of well-known patterns for fusing dependent computations together into a single !MapReduce stage and for performing common types of joins and aggregations. This results in !MapReduce pipelines that are more difficult to test, maintain, and extend to support new functionality. Furthermore, the type of data that is being stored and processed using Apache Hadoop is evolving. Although Hadoop was originally used for storing large volumes of structured text in the form of webpages and log files, it is now common for Hadoop to store complex, structured data formats such as JSON, Apache Avro, and Apache Thrift. These formats allow developers to work with serialized objects in programming languages like Java, C++, and Python, and allow for new types of analysis to be performed on complex data types. Hadoop has also been adopted by the scientific research community, who are using Hadoop to process time series data, structured binary files in the HDF5 format, and large medical and satellite images. Crunch addresses these challenges by providing a lightweight and extensible Java API for defining the stages of a data processing pipeline, which can then be run on an Apache Hadoop cluster as a sequence of dependent !MapReduce jobs, or in-memory on a single machine to facilitate fast testing and debugging. Crunch relies on a small set of primitive abstractions that represent immutable, distributed collections of objects. Developers define functions that are applied to those objects in order to generate new immutable, distributed collections of objects. Crunch also provides a library of common !MapReduce patterns for performing efficient joins and aggregation operations over these distributed collections that developers may integrate into their own pipelines. Crunch also provides native support for processing structured binary data formats like JSON, Apache Avro, and Apache Thrift, and is designed to be extensible to support working with any kind of data format that Java supports in its native form. == Initial Goals == Crunch is currently in its first major release with a considerable number of enhancement requests, tasks, and issues recorded towards its future development. The initial goal of this project will be to continue to build community in the spirit of the Apache
Re: Flume Graduation (was Re: June reports in two weeks)
On May 24, 2012, at 11:49 AM, Ralph Goers wrote: On May 24, 2012, at 10:40 AM, Arvind Prabhakar wrote: Hi, On Thu, May 24, 2012 at 12:19 AM, Ralph Goers ralph.go...@dslextreme.comwrote: The ONLY issue I see for Flume to graduate is diversity. No one will convince me that the current makeup constitutes diversity of any kind. Perhaps I shouldn't have brought up the mailing list issues as that was only meant in the spirit of trying to offer some advice on how more diversity could be achieved. Flume is really the only community I participate in that contains Cloudera employees so I do find myself wondering if the way the project is run is because that is the way all projects with a large number of Cloudera employees are run. That might make all of those participants comfortable but might create a barrier to others. Here are the committers who have been active in the past three months: * Brock Noland (Cloudera) * Hari Shreedharan (Cloudera) * Jarek Jarcec Cecho (AVG Technologies) * Juhani Connolly (CyberAgent) * Mike Percy (Cloudera) * Mingjie Lai (Trend Micro) * Prasad Mujumdar (Cloudera) * Will McQueen (Cloudera) * Arvind Prabhakar (Cloudera) There are four companies represented in this list: AVG Technologies, Cloudera, CyberAgent and Trend Micro. Compared to other projects that have successfully graduated from Incubator in the past, this meets the diversity requirements very well. I was mistaken and the list above is indeed correct. For some reason I thought a couple of them had become Cloudera employees. However, none of those three are currently on the PPMC. When you look at the PPMC list you should also include a few more Cloudera people who do participate in release votes and PPMC issues. Most, if not all, of the non-Cloudera PMC members don't. I started reading some of the Flume website and I think that when you go to the main Wiki page: https://cwiki.apache.org/confluence/display/FLUME/Index When you click on the Flume Cookbook the resource is at cloudera.org. http://archive.cloudera.com/cdh/3/flume/Cookbook/ This page lists flume-...@cloudera.org and is a file with a revision dated May 7, 2012. You can make you own conclusions, but it looks like podling resources need to be migrated to the ASF. Regards, Dave In any case - I'm not insisting that the way the project is run needs to change. I'm simply saying I cannot support graduation with the current makeup of the committers and PMC. I don't have a hard and fast ratio - gaining 10 new unaffiliated committers who don't do much isn't nearly as good as 2 or 3 who are very active. Ultimately the project needs to figure out how to solve this. Stating that some committers who don't do much isn't nearly as good as 2 or 3 who are very active is an unfair characterization. This is more unfair for those who are part of the project but have not been active lately due to whatever reasons, but have played a foundational role in getting the project to a point where it is today. I think they are as important as any other committer who may be very active at the moment. Merit once earned, never expires [1]. [1] http://www.apache.org/dev/committers.html#committer-set-term I think you misunderstood my point or I didn't state it very well. Diversity isn't achieved simply by having bodies. IOW I am not suggesting offering commit rights to people who haven't earned it just to meet some ratio. However, I am not suggesting the project has ever even considered doing that. Ralph - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Invitation to join Apache Kafka as a committer
Works now. Thanks, Joel On Wed, May 23, 2012 at 8:08 PM, Kevan Miller kevan.mil...@gmail.comwrote: On May 23, 2012, at 10:11 PM, Alan D. Cabrera wrote: -kafka-private +kafka-dev +general Ahh, account was only created. According to root: Only PMC chairs can grant karma. If needed, please post to the general@ / dev@/private@ list of your project asking for someone with sufficient karma to grant access to 'jjkoshy'. Sorry about this confusion. I don't have the necessary karma and am so used to other mentors having it that I forgot that the above step needed to be done. Can someone in the IPMC perform the needful? Thanks! cc: incubator general Done. --kevan
Re: Flume Graduation (was Re: June reports in two weeks)
According to Clutch [1] the project has added 8 committers since it entered incubation. Regarding diversity, committers from over four organizations are actively involved in Flume development, which is pretty healthy. There does seem to be a need to have more diversity at the PPMC level, however, so that's something that could be worked on. Tom [1] http://incubator.apache.org/clutch.html On Thu, May 24, 2012 at 2:06 PM, Dave Fisher dave2w...@comcast.net wrote: On May 24, 2012, at 11:49 AM, Ralph Goers wrote: On May 24, 2012, at 10:40 AM, Arvind Prabhakar wrote: Hi, On Thu, May 24, 2012 at 12:19 AM, Ralph Goers ralph.go...@dslextreme.comwrote: The ONLY issue I see for Flume to graduate is diversity. No one will convince me that the current makeup constitutes diversity of any kind. Perhaps I shouldn't have brought up the mailing list issues as that was only meant in the spirit of trying to offer some advice on how more diversity could be achieved. Flume is really the only community I participate in that contains Cloudera employees so I do find myself wondering if the way the project is run is because that is the way all projects with a large number of Cloudera employees are run. That might make all of those participants comfortable but might create a barrier to others. Here are the committers who have been active in the past three months: * Brock Noland (Cloudera) * Hari Shreedharan (Cloudera) * Jarek Jarcec Cecho (AVG Technologies) * Juhani Connolly (CyberAgent) * Mike Percy (Cloudera) * Mingjie Lai (Trend Micro) * Prasad Mujumdar (Cloudera) * Will McQueen (Cloudera) * Arvind Prabhakar (Cloudera) There are four companies represented in this list: AVG Technologies, Cloudera, CyberAgent and Trend Micro. Compared to other projects that have successfully graduated from Incubator in the past, this meets the diversity requirements very well. I was mistaken and the list above is indeed correct. For some reason I thought a couple of them had become Cloudera employees. However, none of those three are currently on the PPMC. When you look at the PPMC list you should also include a few more Cloudera people who do participate in release votes and PPMC issues. Most, if not all, of the non-Cloudera PMC members don't. I started reading some of the Flume website and I think that when you go to the main Wiki page: https://cwiki.apache.org/confluence/display/FLUME/Index When you click on the Flume Cookbook the resource is at cloudera.org. http://archive.cloudera.com/cdh/3/flume/Cookbook/ This page lists flume-...@cloudera.org and is a file with a revision dated May 7, 2012. You can make you own conclusions, but it looks like podling resources need to be migrated to the ASF. Regards, Dave In any case - I'm not insisting that the way the project is run needs to change. I'm simply saying I cannot support graduation with the current makeup of the committers and PMC. I don't have a hard and fast ratio - gaining 10 new unaffiliated committers who don't do much isn't nearly as good as 2 or 3 who are very active. Ultimately the project needs to figure out how to solve this. Stating that some committers who don't do much isn't nearly as good as 2 or 3 who are very active is an unfair characterization. This is more unfair for those who are part of the project but have not been active lately due to whatever reasons, but have played a foundational role in getting the project to a point where it is today. I think they are as important as any other committer who may be very active at the moment. Merit once earned, never expires [1]. [1] http://www.apache.org/dev/committers.html#committer-set-term I think you misunderstood my point or I didn't state it very well. Diversity isn't achieved simply by having bodies. IOW I am not suggesting offering commit rights to people who haven't earned it just to meet some ratio. However, I am not suggesting the project has ever even considered doing that. Ralph - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org