Re: Begin a discussion about Pig as a top level project

Daniel Dai Mon, 05 Apr 2010 15:03:18 -0700

I agree with the stance that we remain in Hadoop until we see morecompelling reasons, such as Pig go beyond Hadoop happens. Currently I cannotfully weight the advantage and disadvantage of becoming a TLP. But providesthis is a point of no return, I don't want to move unless we do have astrong motivation. We can always choose to become TLP later when we feelmore convinced to that.


Daniel


--------------------------------------------------
From: "Santhosh Srinivasan" <s...@yahoo-inc.com>
Sent: Monday, April 05, 2010 12:22 PM
To: <pig-dev@hadoop.apache.org>
Subject: RE: Begin a discussion about Pig as a top level project

"Given that, do you think it makes
sense to say that Pig stays a subproject for now, but if it someday
grows beyond Hadoop only it becomes a TLP?  I could agree to that
stance."

Bingo!

Santhosh

-----Original Message-----
From: Alan Gates [mailto:ga...@yahoo-inc.com]
Sent: Monday, April 05, 2010 11:37 AM
To: pig-dev@hadoop.apache.org
Subject: Re: Begin a discussion about Pig as a top level project

Prognostication is a difficult business.  Of course I'd love it if
someday there is an ISO Pig Latin committee (with meetings in cool
exotic places) deciding the official standard for Pig Latin.  But that
seems like saying in your start up's business plan, "When we reach
Google's size, then we'll do x".  If there ever is an ISO Pig Latin
standard it will be years off.

As others have noted, staying tight to Hadoop now has many advantages,
both in technical and adoption terms.  Hence my advocacy of keeping
Pig Latin Hadoop agnostic while tightly integrating the backend.
Which is to say that in my view, Pig is Hadoop specific now, but there
may come a day when that is no longer true.   Whether Pig will ever
move past just running on Hadoop to running in other parallel systems
won't be known for years to come.  Given that, do you think it makes
sense to say that Pig stays a subproject for now, but if it someday
grows beyond Hadoop only it becomes a TLP?  I could agree to that
stance.

Alan.

On Apr 3, 2010, at 12:43 PM, Santhosh Srinivasan wrote:

I see this as a multi-part question. Looking back at some of the
significant roadmap/existential questions asked in the last 12
months, I
see the following:

1. With the introduction of SQL, what is the philosophy of Pig (I sent
an email about this approximately 9 months ago)
2. What is the approach to support backward compatibility in Pig (Alan
had sent an email about this 3 months ago)
3. Should Pig be a TLP (the current email thread).

Here is my take on answering the aforementioned questions.

The initial philosophy of Pig was to be backend agnostic. It was
designed as a data flow language. Whenever a new language is designed,
the syntax and semantics of the language have to be laid out. The
syntax
is usually captured in the form of a BNF grammar. The semantics are
defined by the language creators. Backward compatibility is then a
question of holding true to the syntax and semantics. With Pig, in
addition to the language, the Java APIs were exposed to customers to
implement UDFs (load/store/filter/grouping/row transformation etc),
provision looping since the language does not support looping
constructs
and also support a programmatic mode of access. Backward compatibility
in this context is to support API versioning.

Do we still intend to position as a data flow language that is backend
agnostic? If the answer is yes, then there is a strong case for making
Pig a TLP.

Are we influenced by Hadoop? A big YES! The reason Pig chose to
become a
Hadoop sub-project was to ride the Hadoop popularity wave. As a
consequence, we chose to be heavily influenced by the Hadoop roadmap.

Like a good lawyer, I also have rebuttals to Alan's questions :)

1. Search engine popularity - We can discuss this with the Hadoop team
and still retain links to TLP's that are coupled (loosely or tightly).
2. Explicit connection to Hadoop - I see this as logical connection
v/s
physical connection. Today, we are physically connected as a
sub-project. Becoming a TLP, will not increase/decrease our
influence on
the Hadoop community (think Logical, Physical and MR Layers :)
3. Philosophy - I have already talked about this. The tight coupling
is
by choice. If Pig continues to be a data flow language with clear
syntax
and semantics then someone can implement Pig on top of a different
backend. Do we intend to take this approach?

I just wanted to offer a different opinion to this thread. I strongly
believe that we should think about the original philosophy. Will we
have
a Pig standards committee that will decide on the changes to the
language (think C/C++) if there are multiple backend implementations?

I will reserve my vote based on the outcome of the philosophy and
backward compatibility discussions. If we decide that Pig will be
treated and maintained like a true language with clear syntax and
semantics then we have a strong case to make it into a TLP. If not, we
should retain our existing ties to Hadoop and make Pig into a data
flow
language for Hadoop.

Santhosh

-----Original Message-----
From: Thejas Nair [mailto:te...@yahoo-inc.com]
Sent: Friday, April 02, 2010 4:08 PM
To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy
Subject: Re: Begin a discussion about Pig as a top level project

I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop,
and
heavily influenced by its roadmap. I think it makes sense to
continue as
a sub-project of hadoop.

-Thejas



On 3/31/10 4:04 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:

Over time, Pig is increasing its coupling to Hadoop (for good
reasons), rather than decreasing it. If and when Pig becomes a viable
entity without hadoop around, it might make sense as a TLP. As is, I
think becoming a TLP will only introduce unnecessary administrative

and bureaucratic headaches.

So my vote is also -1.

-Dmitriy



On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates <ga...@yahoo-inc.com>

wrote:

So far I haven't seen any feedback on this.  Apache has asked the
Hadoop PMC to submit input in April on whether some subprojects
should be promoted to TLPs.  We, the Pig community, need to give
feedback to the Hadoop PMC on how we feel about this.  Please make

your voice heard.


So now I'll head my own call and give my thoughts on it.

The biggest advantage I see to being a TLP is a direct connection to
Apache.  Right now all of the Pig team's interaction with Apache is
through the Hadoop PMC.  Being directly connected to Apache would
benefit Pig team members who would have a better view into Apache.
It would also raise our profile in Apache and thus make other

projects more aware of us.


However, I am concerned about loosing Pig's explicit connection to

Hadoop.

This concern has a couple of dimensions.  One, Hadoop and MapReduce
are the current flavor of the month in computing.  Given that Pig
shares a name with the common farm animal, it's hard to be sure
based

on search statistics.

But Google trends shows that "hadoop" is searched on much more
frequently than "hadoop pig" or "apache pig" (see
http://www.google.com/trends?q=hadoop%2Chadoop+pig).  I am guessing
that most Pig users come from Hadoop users who discover Pig via

Hadoop's website.

Loosing that subproject tab on Hadoop's front page may radically
lower the number of users coming to Pig to check out our project.  I
would argue that this benefits Hadoop as well, since high level
languages like Pig Latin have the potential to greatly extend the

user base and usability of Hadoop.


Two, being explicitly connected to Hadoop keeps our two communities
aware of each others needs.  There are features proposed for MR that
would greatly help Pig.  By staying in the Hadoop community Pig is
better positioned to advocate for and help implement and test those
features.  The response to this will be that Pig developers can
still

subscribe to Hadoop mailing lists, submit patches, etc.  That is,
they can still be part of the Hadoop community.  Which reinforces my
point that it makes more sense to leave Pig in the Hadoop community
since Pig developers will need to be part of that community anyway.

Finally, philosophically it makes sense to me that projects that are
tightly connected belong together.  It strikes me as strange to have
Pig as a TLP completely dependent on another TLP.  Hadoop was
originally a subproject of Lucene.  It moved out to be a TLP when it
became obvious that Hadoop had become independent of and useful
apart

from Lucene.  Pig is not in that position relative to Hadoop.

So, I'm -1 on Pig moving out.  But this is a soft -1.  I'm open to
being persuaded that I'm wrong or my concerns can be addressed while
still having Pig as a TLP.

Alan.


On Mar 19, 2010, at 10:59 AM, Alan Gates wrote:

You have probably heard by now that there is a discussion going on
in the

Hadoop PMC as to whether a number of the subprojects (Hbase, Avro,
Zookeeper, Hive, and Pig) should move out from under the Hadoop
umbrella and become top level Apache projects (TLP).  This
discussion has picked up recently since the Apache board has
clearly

communicated to the Hadoop PMC that it is concerned that Hadoop is
acting as an umbrella project with many disjoint subprojects
underneath it.  They are concerned that this gives Apache little
insight into the health and happenings of the subproject
communities

which in turn means Apache cannot properly mentor those
communities.

The purpose of this email is to start a discussion within the Pig
community about this topic.  Let me cover first what becoming TLP
would mean for Pig, and then I'll go into what options I think we
as

a community have.


Becoming a TLP would mean that Pig would itself have a PMC that
would report directly to the Apache board.  Who would be on the PMC
would be something we as a community would need to decide.  Common
options would be to say all active committers are on the PMC, or
all

active committers who have been a committer for at least a year.
We

would also need to elect a chair of the PMC.  This lucky person
would have no additional power, but would have the additional
responsibility of writing quarterly reports on Pig's status for
Apache board meetings, as well as coordinating with Apache to get
accounts for new  committers, etc.  For more information see
http://www.apache.org/foundation/how-it-works.html#roles

Becoming a TLP would not mean that we are ostracized from the
Hadoop

community.  We would continue to be invited to Hadoop Summits,
HUGs,

etc.

Since all Pig developers and users are by definition Hadoop users,
we would continue to be a strong presence in the Hadoop community.

I see three ways that we as a community can respond to this:

1) Say yes, we want to be a TLP now.
2) Say yes, we want to be a TLP, but not yet.  We feel we need more
time to mature.  If we choose this option we need to be able to
clearly articulate how much time we need and what we hope to see
change in that time.
3) Say no, we feel the benefits for us staying with Hadoop outweigh
the drawbacks of being a disjoint subproject.  If we choose this,
we

need to be able to say exactly what those benefits are and why we
feel they will be compromised by leaving the Hadoop project.

There may other options that I haven't thought of.  Please feel
free

to suggest any you think of.

Questions?  Thoughts?  Let the discussion begin.

Alan.

Re: Begin a discussion about Pig as a top level project

Reply via email to