Re: Begin a discussion about Pig as a top level project

Alan Gates Mon, 05 Apr 2010 11:24:39 -0700

I agree that Pig's code documentation is in sad shape. I think ouruser documentation for each release is good, of limited. I hope thatour documents on wiki (such as PigJournal) help people understand ourroadmap. Please let us know if you disagree so we can find ways toimprove it.

That said, it isn't clear to me how Pig being a TLP will solve that.The current committers or some subset thereof (see original message)would become the PMC. Other than having expanded powers to vote onreleases and who becomes new committers, the role of these new PMCmembers would not change much. They won't have anymore time toaddress documentation and communication issues. We need to find a wayto address those no matter what governance framework or community Pigis in.


Alan.

On Apr 5, 2010, at 9:02 AM, hc busy wrote:

This is awesome!!! As much as I hate PJM's for wasting time at all the
places that I've worked at, I think formalizing the managementgroup(PMC) toopenly and clearly determine feature roadmap and dev schedule is thebest
thing pig can have.

I once commented to my co-worker (also heavy pig user) that pig's
organization (with all due respect to all you hardworking people) islike apigsty! documentations all over the place, javadocs from threeversions ago,much of the documentation doesn't match actual features... links tothe
download page is broken.
If you look at cascading's website... it's so much cleaner. (Ofcourse... we
still use pig because it works well)
I think as TLP, pig will receive better marketing and better supportin away that will propel it both in popularity and in the amount ofsupport it
receives.

As a user, that change will be good for me.


On Sun, Apr 4, 2010 at 11:10 PM, Ashutosh Chauhan <
ashutosh.chau...@gmail.com> wrote:
I concur with Santhosh here. I think main question we need to answer
here is how close our ties are with Hadoop currently and how it will
be in future ? When Pig was originally designed the intent was tokeep
it backend neutral, so  much so that there was a reference backend
implementation (also known as local engine) which had nothing to do
with Hadoop. But things have changed since then. Hadoop's local mode
is adopted in favor of Pig's own local mode. We have moved from being
backend agnostic to hadoop favoring. And while this was happening, it
seems we tried to keep Pig Latin language independent of hadoop
backend  while Pig runtime started to make use of hadoop concepts.

Apart from design decisions, this move also has a practical impact on
our codebase. Since we adopted Hadoop more closely, we got rid of an
extra layer of abstraction and instead started using similar
abstractions already existing in Hadoop. This has a positive impact
that it simplified the codebase and provides tighter integration with
Hadoop.
So, if we are continuing in a direction where Hadoop is our only
backend (or atleast a favored one), close ties to Hadoop are useful
because of the reasons Alan and Dmitriy pointed out. if not, then I
think moving out to TLP makes sense. Since, there is no efforts which
I am aware of, is trying to plug in a different backend for Pig, I
think maintaining close ties with Hadoop is useful for Pig. In future
when there is a different distributed computing platform comes up
which we want to use as backend, we can revisit our decision. So, as
for things stand today I am -1 to move out of  Hadoop.

And I would also like to reiterate my point that though Pig runtime
may continue to get closer to Hadoop, we shall keep Pig Latin
completely backend agnostic.

Ashutosh

On Sat, Apr 3, 2010 at 12:43, Santhosh Srinivasan <s...@yahoo-inc.com>
wrote:
I see this as a multi-part question. Looking back at some of the
significant roadmap/existential questions asked in the last 12months, I
see the following:
1. With the introduction of SQL, what is the philosophy of Pig (Isent
an email about this approximately 9 months ago)
2. What is the approach to support backward compatibility in Pig(Alan
had sent an email about this 3 months ago)
3. Should Pig be a TLP (the current email thread).

Here is my take on answering the aforementioned questions.

The initial philosophy of Pig was to be backend agnostic. It was
designed as a data flow language. Whenever a new language isdesigned,the syntax and semantics of the language have to be laid out. Thesyntax
is usually captured in the form of a BNF grammar. The semantics are
defined by the language creators. Backward compatibility is then a
question of holding true to the syntax and semantics. With Pig, in
addition to the language, the Java APIs were exposed to customers to
implement UDFs (load/store/filter/grouping/row transformation etc),
provision looping since the language does not support loopingconstructsand also support a programmatic mode of access. Backwardcompatibility
in this context is to support API versioning.
Do we still intend to position as a data flow language that isbackendagnostic? If the answer is yes, then there is a strong case formaking
Pig a TLP.
Are we influenced by Hadoop? A big YES! The reason Pig chose tobecome a
Hadoop sub-project was to ride the Hadoop popularity wave. As a
consequence, we chose to be heavily influenced by the Hadooproadmap.
Like a good lawyer, I also have rebuttals to Alan's questions :)
1. Search engine popularity - We can discuss this with the Hadoopteamand still retain links to TLP's that are coupled (loosely ortightly).2. Explicit connection to Hadoop - I see this as logicalconnection v/s
physical connection. Today, we are physically connected as a
sub-project. Becoming a TLP, will not increase/decrease ourinfluence on
the Hadoop community (think Logical, Physical and MR Layers :)
3. Philosophy - I have already talked about this. The tightcoupling isby choice. If Pig continues to be a data flow language with clearsyntax
and semantics then someone can implement Pig on top of a different
backend. Do we intend to take this approach?
I just wanted to offer a different opinion to this thread. Istronglybelieve that we should think about the original philosophy. Willwe have
a Pig standards committee that will decide on the changes to the
language (think C/C++) if there are multiple backendimplementations?
I will reserve my vote based on the outcome of the philosophy and
backward compatibility discussions. If we decide that Pig will be
treated and maintained like a true language with clear syntax and
semantics then we have a strong case to make it into a TLP. Ifnot, weshould retain our existing ties to Hadoop and make Pig into a dataflow
language for Hadoop.

Santhosh

-----Original Message-----
From: Thejas Nair [mailto:te...@yahoo-inc.com]
Sent: Friday, April 02, 2010 4:08 PM
To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy
Subject: Re: Begin a discussion about Pig as a top level project
I agree with Alan and Dmitriy - Pig is tightly coupled withhadoop, andheavily influenced by its roadmap. I think it makes sense tocontinue as
a sub-project of hadoop.

-Thejas



On 3/31/10 4:04 PM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:
Over time, Pig is increasing its coupling to Hadoop (for good
reasons), rather than decreasing it. If and when Pig becomes aviableentity without hadoop around, it might make sense as a TLP. Asis, I
think becoming a TLP will only introduce unnecessary administrative
and bureaucratic headaches.
So my vote is also -1.

-Dmitriy



On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates <ga...@yahoo-inc.com>
wrote:
So far I haven't seen any feedback on this.  Apache has asked the
Hadoop PMC to submit input in April on whether some subprojects
should be promoted to TLPs.  We, the Pig community, need to give
feedback to the Hadoop PMC on how we feel about this.  Please make
your voice heard.
So now I'll head my own call and give my thoughts on it.
The biggest advantage I see to being a TLP is a directconnection toApache. Right now all of the Pig team's interaction with Apacheis
through the Hadoop PMC.  Being directly connected to Apache would
benefit Pig team members who would have a better view into Apache.
It would also raise our profile in Apache and thus make other
projects more aware of us.
However, I am concerned about loosing Pig's explicit connection to
Hadoop.
This concern has a couple of dimensions. One, Hadoop andMapReduce
are the current flavor of the month in computing.  Given that Pig
shares a name with the common farm animal, it's hard to be surebased
on search statistics.
But Google trends shows that "hadoop" is searched on much more
frequently than "hadoop pig" or "apache pig" (see
http://www.google.com/trends?q=hadoop%2Chadoop+pig). I amguessing
that most Pig users come from Hadoop users who discover Pig via
Hadoop's website.
Loosing that subproject tab on Hadoop's front page may radically
lower the number of users coming to Pig to check out ourproject. I
would argue that this benefits Hadoop as well, since high level
languages like Pig Latin have the potential to greatly extend the
user base and usability of Hadoop.
Two, being explicitly connected to Hadoop keeps our twocommunitiesaware of each others needs. There are features proposed for MRthat
would greatly help Pig.  By staying in the Hadoop community Pig is
better positioned to advocate for and help implement and testthosefeatures. The response to this will be that Pig developers canstill
subscribe to Hadoop mailing lists, submit patches, etc.  That is,
they can still be part of the Hadoop community. Whichreinforces mypoint that it makes more sense to leave Pig in the Hadoopcommunitysince Pig developers will need to be part of that communityanyway.
Finally, philosophically it makes sense to me that projects thataretightly connected belong together. It strikes me as strange tohave
Pig as a TLP completely dependent on another TLP.  Hadoop was
originally a subproject of Lucene. It moved out to be a TLPwhen itbecame obvious that Hadoop had become independent of and usefulapart
from Lucene.  Pig is not in that position relative to Hadoop.

So, I'm -1 on Pig moving out.  But this is a soft -1.  I'm open to
being persuaded that I'm wrong or my concerns can be addressedwhile
still having Pig as a TLP.

Alan.


On Mar 19, 2010, at 10:59 AM, Alan Gates wrote:

You have probably heard by now that there is a discussion going on
in the
Hadoop PMC as to whether a number of the subprojects (Hbase,Avro,
Zookeeper, Hive, and Pig) should move out from under the Hadoop
umbrella and become top level Apache projects (TLP).  This
discussion has picked up recently since the Apache board hasclearly
communicated to the Hadoop PMC that it is concerned that Hadoopis
acting as an umbrella project with many disjoint subprojects
underneath it.  They are concerned that this gives Apache little
insight into the health and happenings of the subprojectcommunities
which in turn means Apache cannot properly mentor thosecommunities.
The purpose of this email is to start a discussion within the Pig
community about this topic.  Let me cover first what becoming TLP
would mean for Pig, and then I'll go into what options I thinkwe as
a community have.
Becoming a TLP would mean that Pig would itself have a PMC that
would report directly to the Apache board. Who would be on thePMCwould be something we as a community would need to decide.Commonoptions would be to say all active committers are on the PMC,or all
active committers who have been a committer for at least ayear. We
would also need to elect a chair of the PMC.  This lucky person
would have no additional power, but would have the additional
responsibility of writing quarterly reports on Pig's status for
Apache board meetings, as well as coordinating with Apache to get
accounts for new  committers, etc.  For more information see
http://www.apache.org/foundation/how-it-works.html#roles
Becoming a TLP would not mean that we are ostracized from theHadoop
community. We would continue to be invited to Hadoop Summits,HUGs,
etc.
Since all Pig developers and users are by definition Hadoopusers,we would continue to be a strong presence in the Hadoopcommunity.
I see three ways that we as a community can respond to this:

1) Say yes, we want to be a TLP now.
2) Say yes, we want to be a TLP, but not yet. We feel we needmore
time to mature.  If we choose this option we need to be able to
clearly articulate how much time we need and what we hope to see
change in that time.
3) Say no, we feel the benefits for us staying with Hadoopoutweighthe drawbacks of being a disjoint subproject. If we choosethis, we
need to be able to say exactly what those benefits are and why we
feel they will be compromised by leaving the Hadoop project.
There may other options that I haven't thought of. Please feelfree
to suggest any you think of.

Questions?  Thoughts?  Let the discussion begin.

Alan.

Re: Begin a discussion about Pig as a top level project

Reply via email to