Re: [PROPOSAL] Flume for the Apache Incubator

Upayavira Tue, 31 May 2011 12:29:26 -0700


On Tue, 31 May 2011 11:53 -0700, "Jonathan Hsieh" <j...@cloudera.com>
wrote:
> Hi,
> 
> I have a few questions.
> 
> My understanding is that a podling requires 3 +1s for progress (releases,
> new commiters).  Does this mean we need at least 3 mentors?  Would it be
> helpful to have "extra"?


Strictly you do not need three mentors, but having three means you (in
theory) have three people with binding votes watching your progress,
which makes the necessary votes much easier, so having 'extra' can help.
Too many mentors though can lead to them all thinking that someone else
is doing it. Three does seem an optimal number.

> Is having the Champion being a Mentor ok?

Yes, it is fine.
 
> Are there any concerns/discussion with the proposal?  (or are +1's
> basically saying lgtm.)

I think the +1s are saying that (although I haven't read the proposal).

Upayavira


> On Tue, May 31, 2011 at 4:13 AM, Mohammad Nour El-Din <
> nour.moham...@gmail.com> wrote:
> 
> > +1 (binding)
> >
> > On Tue, May 31, 2011 at 11:59 AM, Mark Struberg <strub...@yahoo.de> wrote:
> > > +1
> > >
> > > LieGrue,
> > > strub
> > >
> > > --- On Mon, 5/30/11, Yoav Shapira <yo...@apache.org> wrote:
> > >
> > >> From: Yoav Shapira <yo...@apache.org>
> > >> Subject: Re: [PROPOSAL] Flume for the Apache Incubator
> > >> To: general@incubator.apache.org
> > >> Date: Monday, May 30, 2011, 11:18 PM
> > >> On Fri, May 27, 2011 at 10:18 AM,
> > >> Jonathan Hsieh <j...@cloudera.com>
> > >> wrote:
> > >> > I would like to propose Flume to be an Apache
> > >> Incubator project.  Flume is a
> > >> > distributed, reliable, and available system for
> > >> efficiently collecting,
> > >> > aggregating, and moving large amounts of log data to
> > >> scalable data storage
> > >> > systems such as Apache Hadoop's HDFS.
> > >> >
> > >> > Here's a link to the proposal in the Incubator wiki
> > >> > http://wiki.apache.org/incubator/FlumeProposal
> > >>
> > >> +1, cool stuff.
> > >>
> > >> Yoav
> > >>
> > >> >
> > >> > I've also pasted the initial contents below.
> > >> >
> > >> > Thanks!
> > >> > Jon.
> > >> >
> > >> > = Flume - A Distributed Log Collection System =
> > >> >
> > >> > == Abstract ==
> > >> >
> > >> > Flume is a distributed, reliable, and available system
> > >> for efficiently
> > >> > collecting, aggregating, and moving large amounts of
> > >> log data to scalable
> > >> > data storage systems such as Apache Hadoop's HDFS.
> > >> >
> > >> > == Proposal ==
> > >> >
> > >> > Flume is a distributed, reliable, and available system
> > >> for efficiently
> > >> > collecting, aggregating, and moving large amounts of
> > >> log data from many
> > >> > different sources to a centralized data store. Its
> > >> main goal is to deliver
> > >> > data from applications to Hadoop’s HDFS.  It has a
> > >> simple and flexible
> > >> > architecture for transporting streaming event data via
> > >> flume nodes to the
> > >> > data store.  It is robust and fault-tolerant with
> > >> tunable reliability
> > >> > mechanisms that rely upon many failover and recovery
> > >> mechanisms. The system
> > >> > is centrally configured and allows for intelligent
> > >> dynamic management. It
> > >> > uses a simple extensible data model that allows for
> > >> lightweight online
> > >> > analytic applications.  It provides a pluggable
> > >> mechanism by which new
> > >> > sources, destinations, and analytic functions which
> > >> can be integrated within
> > >> > a Flume pipeline.
> > >> >
> > >> > == Background ==
> > >> >
> > >> > Flume was initially developed by Cloudera to enable
> > >> reliable and simplified
> > >> > collection of log information from many distributed
> > >> sources. It was later
> > >> > open-sourced by Cloudera on GitHub as an Apache 2.0
> > >> licensed project in June
> > >> > 2010. During this time Flume has been formally
> > >> released five times as
> > >> > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1
> > >> (Oct 2010), 0.9.2 (Nov
> > >> > 2010), and 0.9.3 (Feb 2011).  These releases are also
> > >> distributed by
> > >> > Cloudera as source and binaries along with
> > >> enhancements as part of Cloudera
> > >> > Distribution including Apache Hadoop (CDH).
> > >> >
> > >> > == Rationale ==
> > >> >
> > >> > Collecting log information in a data center in a
> > >> timely, reliable, and
> > >> > efficient manner is a difficult challenge but
> > >> important because when
> > >> > aggregated and analyzed, log information can yield
> > >> valuable business
> > >> > insights.   We believe that users and operators need
> > >> a manageable systematic
> > >> > approach for log collection that simplifies the
> > >> creation, the monitoring,
> > >> > and the administration of reliable log data pipelines.
> > >>  Oftentimes today,
> > >> > this collection is attempted by periodically shipping
> > >> data in batches and by
> > >> > using potentially unreliable and inefficient ad-hoc
> > >> methods.
> > >> >
> > >> > Log data is typically generated in various systems
> > >> running within a data
> > >> > center that can range from a few machines to hundreds
> > >> of machines.  In
> > >> > aggregate, the data acts like a large-volume
> > >> continuous stream with contents
> > >> > that can have highly-varied format and highly-varied
> > >> content.  The volume
> > >> > and variety of raw log data makes Apache Hadoop's HDFS
> > >> file system an ideal
> > >> > storage location before the eventual analysis.
> > >>  Unfortunately, HDFS has
> > >> > limitations with regards to durability as well as
> > >> scaling limitations when
> > >> > handling a large number of low-bandwidth connections
> > >> or small files.
> > >> >  Similar technical challenges are also suffered when
> > >> attempting to write
> > >> > data to other data storage services.
> > >> >
> > >> > Flume addresses these challenges by providing a
> > >> reliable, scalable,
> > >> > manageable, and extensible solution.  It uses a
> > >> streaming design for
> > >> > capturing and aggregating log information from varied
> > >> sources in a
> > >> > distributed environment and has centralized management
> > >> features for minimal
> > >> > configuration and management overhead.
> > >> >
> > >> > == Initial Goals ==
> > >> >
> > >> > Flume is currently in its first major release with a
> > >> considerable number of
> > >> > enhancement requests, tasks, and issues recorded
> > >> towards its future
> > >> > development. The initial goal of this project will be
> > >> to continue to build
> > >> > community in the spirit of the "Apache Way", and to
> > >> address the highly
> > >> > requested features and bug-fixes towards the next dot
> > >> release.
> > >> >
> > >> > Some goals include:
> > >> > * To stand up a sustaining Apache-based community
> > >> around the Flume codebase.
> > >> > * Implementing core functionality of a usable
> > >> highly-available Flume master.
> > >> > * Performance, usability, and robustness
> > >> improvements.
> > >> > * Improving the ability to monitor and diagnose
> > >> problems as data is
> > >> > transported.
> > >> > * Providing a centralized place for contributed
> > >> connectors and related
> > >> > projects.
> > >> >
> > >> > = Current Status =
> > >> >
> > >> > == Meritocracy ==
> > >> >
> > >> > Flume was initially developed by Jonathan Hsieh in
> > >> July 2009 along with
> > >> > development team at Cloudera. Developers external to
> > >> Cloudera provided
> > >> > feedback, suggested features and fixes and implemented
> > >> extensions of Flume.
> > >> > Cloudera engineering team has since maintained the
> > >> project with Jonathan
> > >> > Hsieh, Henry Robinson, and Patrick Hunt dedicated
> > >> towards its improvement.
> > >> > Contributors to Flume and its connectors include
> > >> developers from different
> > >> > companies and different parts of the world.
> > >> >
> > >> > == Community ==
> > >> >
> > >> > Flume is currently used by a number of organizations
> > >> all over the world.
> > >> > Flume has an active and growing user and developer
> > >> community with active
> > >> > participation in [user|
> > >> > https://groups.google.com/a/cloudera.org/group/flume-user/topics]
> > >> and
> > >> > [developer|
> > https://groups.google.com/a/cloudera.org/group/flume-dev/topics]
> > >> > mailing lists.  The users and developers also
> > >> communicate via IRC on #flume
> > >> > at irc.freenode.net.
> > >> >
> > >> > Since open sourcing the project, there have been over
> > >> 15 different people
> > >> > from diverse organizations who have contributed code.
> > >> During this period,
> > >> > the project team has hosted open, in-person, quarterly
> > >> meetups to discuss
> > >> > new features, new designs, and new use-case stories.
> > >> >
> > >> > == Core Developers ==
> > >> >
> > >> > The core developers for Flume project are:
> > >> >  * Andrew Bayer: Andrew has a lot of expertise with
> > >> build tools,
> > >> > specifically Jenkins continuous integration and
> > >> Maven.
> > >> >  * Jonathan Hsieh: Jonathan designed and implemented
> > >> much of the original
> > >> > code.
> > >> >  * Patrick Hunt: Patrick has improved the web
> > >> interfaces of Flume components
> > >> > and contributed several build quality  improvements.
> > >> >  * Bruce Mitchener: Bruce has improved the internal
> > >> logging infrastructure
> > >> > as well as edited significant portions of the Flume
> > >> manual.
> > >> >  * Henry Robinson: Henry has implemented much of the
> > >> ZooKeeper integration,
> > >> > plugin mechanisms, as well as several Flume features
> > >> and bug fixes.
> > >> >  * Eric Sammer: Eric has implemented the Maven build,
> > >> as well as several
> > >> > Flume features and bug fixes.
> > >> >
> > >> > All core developers of the Flume project have
> > >> contributed towards Hadoop or
> > >> > related Apache projects and are very familiar with
> > >> Apache principals and
> > >> > philosophy for community driven software development.
> > >> >
> > >> > == Alignment ==
> > >> >
> > >> > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase
> > >> by providing a robust
> > >> > mechanism to allow log data integration from external
> > >> systems for effective
> > >> > analysis.  Its design enable efficient integration of
> > >> newly ingested data to
> > >> > Hive's data warehouse.
> > >> >
> > >> > Flume's architecture is open and easily extensible.
> > >>  This has encouraged
> > >> > many users to contribute integrate plugins to other
> > >> projects.  For example,
> > >> > several users have contributed connectors to message
> > >> queuing and bus
> > >> > services, to several open source data stores, to
> > >> incremental search indexes,
> > >> > and to a stream analysis engines.
> > >> >
> > >> > = Known Risks =
> > >> >
> > >> > == Orphaned Products ==
> > >> >
> > >> > Flume is already deployed in production at multiple
> > >> companies and they are
> > >> > actively participating in feature requests and user
> > >> led discussions. Flume
> > >> > is getting traction with developers and thus the risks
> > >> of it being orphaned
> > >> > are minimal.
> > >> >
> > >> > == Inexperience with Open Source ==
> > >> >
> > >> > All code developed for Flume has is open sourced by
> > >> Cloudera under Apache
> > >> > 2.0 license.  All committers of Flume project are
> > >> intimately familiar with
> > >> > the Apache model for open-source development and are
> > >> experienced with
> > >> > working with new contributors.
> > >> >
> > >> > == Homogeneous Developers ==
> > >> >
> > >> > The initial set of committers is from a reduced set of
> > >> organizations.
> > >> > However, we expect that once approved for incubation,
> > >> the project will
> > >> > attract new contributors from diverse organizations
> > >> and will thus grow
> > >> > organically. The participation of developers from
> > >> several different
> > >> > organizations in the mailing list is a strong
> > >> indication for this assertion.
> > >> >
> > >> > == Reliance on Salaried Developers ==
> > >> >
> > >> > It is expected that Flume will be developed on
> > >> salaried and volunteer time,
> > >> > although all of the initial developers will work on it
> > >> mainly on salaried
> > >> > time.
> > >> >
> > >> > == Relationships with Other Apache Products ==
> > >> >
> > >> > Flume depends upon other Apache Projects: Apache
> > >> Hadoop, Apache Log4J,
> > >> > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple
> > >> Apache Commons
> > >> > components. Its build depends upon Apache Ant and
> > >> Apache Maven.
> > >> >
> > >> > Flume users have created connectors that interact with
> > >> several other Apache
> > >> > projects including Apache HBase and Apache Cassandra.
> > >> >
> > >> > Flume's functionality has some indirect or direct
> > >> overlap with the
> > >> > functionality of Apache Chukwa but has several
> > >> significant architectural
> > >> > diffferences.  Both systems can be used to collect
> > >> log data to write to
> > >> > hdfs.  However, Chukwa's primary goals are the
> > >> analytic and monitoring
> > >> > aspects of a Hadoop cluster.  Instead of focusing on
> > >> analytics, Flume
> > >> > focuses primarily upon data transport and integration
> > >> with a wide set of
> > >> > data sources and data destinations.
> > >> Architecturally, Chukwa components are
> > >> > individually and statically configured.  It also
> > >> depends upon Hadoop
> > >> > MapReduce for its core functionality.  In contrast,
> > >> Flume's components are
> > >> > dynamically and centrally configured and does not
> > >> depend directly upon
> > >> > Hadoop MapReduce.  Furthermore, Flume provides a more
> > >> general model for
> > >> > handling data and enables integration with projects
> > >> such as Apache Hive,
> > >> > data stores such as Apache HBase, Apache Cassandra and
> > >> Voldemort, and
> > >> > several Apache Lucene-related projects.
> > >> >
> > >> > == An Excessive Fascination with the Apache Brand ==
> > >> >
> > >> > We would like Flume to become an Apache project to
> > >> further foster a healthy
> > >> > community of contributors and consumers around the
> > >> project.  Since Flume
> > >> > directly interacts with many Apache Hadoop-related
> > >> projects by solves an
> > >> > important problem of many Hadoop users, residing in
> > >> the the Apache Software
> > >> > Foundation will increase interaction with the larger
> > >> community.
> > >> >
> > >> > = Documentation =
> > >> >
> > >> >  * All Flume documentation (User Guide, Developer
> > >> Guide, Cookbook, and
> > >> > Windows Guide) is maintained within Flume sources and
> > >> can be built directly.
> > >> >  * Cloudera provides documentation specific to its
> > >> distribution of Flume at:
> > >> > http://archive.cloudera.com/cdh/3/flume/
> > >> >  * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki
> > >> >  * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume
> > >> >
> > >> > = Initial Source =
> > >> >
> > >> >  * https://github.com/cloudera/flume/tree/
> > >> >
> > >> > == Source and Intellectual Property Submission Plan
> > >> ==
> > >> >
> > >> >  * The initial source is already licensed under the
> > >> Apache License, Version
> > >> > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE
> > >> >
> > >> > == External Dependencies ==
> > >> >
> > >> > The required external dependencies are all Apache
> > >> License or compatible
> > >> > licenses. Following components with non-Apache
> > >> licenses are enumerated:
> > >> >
> > >> >  * org.arabidopsis.ahocorasick : BSD-style
> > >> >
> > >> > Non-Apache build tools that are used by Flume are as
> > >> follows:
> > >> >
> > >> >  * AsciiDoc: GNU GPLv2
> > >> >  * FindBugs: GNU LGPL
> > >> >  * Cobertura: GNU GPLv2
> > >> >  * PMD : BSD-style
> > >> >
> > >> > == Cryptography ==
> > >> >
> > >> > Flume uses standard APIs and tools for SSH and SSL
> > >> communication where
> > >> > necessary.
> > >> >
> > >> > = Required  Resources =
> > >> >
> > >> > == Mailing lists ==
> > >> >
> > >> >  * flume-private (with moderated subscriptions)
> > >> >  * flume-dev
> > >> >  * flume-commits
> > >> >  * flume-user
> > >> >
> > >> > == Subversion Directory ==
> > >> >
> > >> > https://svn.apache.org/repos/asf/incubator/flume
> > >> >
> > >> > == Issue Tracking ==
> > >> >
> > >> > JIRA Flume (FLUME)
> > >> >
> > >> > == Other Resources ==
> > >> >
> > >> > The existing code already has unit and integration
> > >> tests so we would like a
> > >> > Hudson instance to run them whenever a new patch is
> > >> submitted. This can be
> > >> > added after project creation.
> > >> >
> > >> > = Initial Committers =
> > >> >
> > >> >  * Andrew Bayer (abayer at cloudera dot com)
> > >> >  * Jonathan Hsieh (jon at cloudera dot com)
> > >> >  * Aaron Kimball (akimball83 at gmail dot com)
> > >> >  * Bruce Mitchener (bruce.mitchener at gmail dot
> > >> com)
> > >> >  * Arvind Prabhakar (arvind at cloudera dot com)
> > >> >  * Ahmed Radwan (ahmed at cloudera dot com)
> > >> >  * Henry Robinson (henry at cloudera dot com)
> > >> >  * Eric Sammer (esammer at cloudera dot com)
> > >> >
> > >> > = Affiliations =
> > >> >
> > >> >  * Andrew Bayer, Cloudera
> > >> >  * Jonathan Hsieh, Cloudera
> > >> >  * Aaron Kimball, Odiago
> > >> >  * Bruce Mitchener, Independent
> > >> >  * Arvind Prabhakar, Cloudera
> > >> >  * Ahmed Radwan, Cloudera
> > >> >  * Henry Robinson, Cloudera
> > >> >  * Eric Sammer, Cloudera
> > >> >
> > >> >
> > >> > = Sponsors =
> > >> >
> > >> > == Champion ==
> > >> >
> > >> >  * Nigel Daley
> > >> >
> > >> > == Nominated Mentors ==
> > >> >
> > >> >  * Tom White
> > >> >  * Nigel Daley
> > >> >
> > >> > == Sponsoring Entity ==
> > >> >
> > >> >  * Apache Incubator PMC
> > >> >
> > >> >
> > >> > --
> > >> > // Jonathan Hsieh (shay)
> > >> > // Software Engineer, Cloudera
> > >> > // j...@cloudera.com
> > >> >
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > >> For additional commands, e-mail: general-h...@incubator.apache.org
> > >>
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > For additional commands, e-mail: general-h...@incubator.apache.org
> > >
> > >
> >
> >
> >
> > --
> > Thanks
> > - Mohammad Nour
> >   Author of (WebSphere Application Server Community Edition 2.0 User Guide)
> >   http://www.redbooks.ibm.com/abstracts/sg247585.html
> > - LinkedIn: http://www.linkedin.com/in/mnour
> > - Blog: http://tadabborat.blogspot.com
> > ----
> > "Life is like riding a bicycle. To keep your balance you must keep moving"
> > - Albert Einstein
> >
> > "Writing clean code is what you must do in order to call yourself a
> > professional. There is no reasonable excuse for doing anything less
> > than your best."
> > - Clean Code: A Handbook of Agile Software Craftsmanship
> >
> > "Stay hungry, stay foolish."
> > - Steve Jobs
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
> 
> 
> -- 
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // j...@cloudera.com
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Flume for the Apache Incubator

Reply via email to