Re: Revisit Pig Philosophy?

Amr Awadallah Mon, 21 Sep 2009 15:30:38 -0700

> Pig Latin is intended to be a language for parallel data processing.It is not tied to one particular parallel framework

+1


-- amr

Alan Gates wrote:

I agree with Milind that we should move to saying that Pig Latin is adata flow language independent of any particular platform, while thecurrent implementation of Pig is tied to Hadoop. I'm not sure howthin that implementation will be, but I'm in favor of making it thinwhere possible (such as the recent proposal to shift LoadFunc todirectly use InputFormat).
I also strongly agree that we need to be more precise in ourterminology between Pig (the platform) and Pig Latin (the language),especially as we're working on making Pig bilingual (with the additionof SQL).
I am fine with saying that Pig SQL adheres as much as possible (giventhe underlying systems, etc.) to ANSI SQL semantics. And where thereis shared functionality such as UDFs we again adhere to SQL semanticswhen it does not conflict with other Pig goals. So COUNT, and SUMshould handle nulls the way SQL does, for example. But we need tocraft the statement carefully. To see why, consider Pig's datamodel. We would like our types to map nicely into SQL types, so thatif Pig SQL users declare a column to be of type VARCHAR(32) orFLOAT(10) we can map those onto some Pig type. But we don't want touse SQL types directly inside Pig, as they aren't a good match formuch of Pig processing. So any statement of using SQL semantics needscaveats.
I would also vote for modifying our Pigs Live Anywhere dictum to be:
Pig Latin is intended to be a language for parallel data processing.It is nottied to one particular parallel framework. The initial implementationof Pig is on Hadoop and seeks to leverage the power of Hadoopwherever possible. However, nothing Hadoop specific should be exposedin Pig Latin.
We may also want to add a vocabulary section to the philosophystatement to clarify between Pig and Pig Latin.
Alan.


On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote:
It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the
first question that needs to be answered is "what is pig" ? (this
corresponds to the Hindu philosophy's basic argument, that any deep
personal philosophical investigations need to start with a question
"koham?" (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that
philosophy, we need to ask "what is pig?" (incidentally, pig, or
varaaha in Sanskrit, was the second incarnation of lord Vishnu in
hindu scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think
not. "Pig Latin" is the language. Pig is referred to in countless
slide decks ( aka pig scriptures, btw I own 50% of these scriptures)
as a runtime system that interprets pig Latin, kind of like java and
jvm. (Duality of nature, called "dwaita" philosophy in sanskrit is
applicable here. But I won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be
implemented on any runtime. But pig the runtime's philosophy could be
that it is a thin layer on top of hadoop. And all the world could
breathe a sigh of relief. (mostly, by not having to answer these
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to
answer. That's all.

AUM...... (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, "Jeff Hammerbacher" <ham...@cloudera.com>
wrote:
Hey,
2. Local mode and other parallel frameworks

<snip>
Pigs Live Anywhere

Pig is intended to be a language for parallel data processing. It
is not
tied to one particular parallel framework. It has been implemented
first
on hadoop, but we do not intend that to be only on hadoop.
</snip>

Are we still holding onto this? What about local mode? Local mode
is not
being treated on equal footing with that of Hadoop for practical
reasons. However, users expect things that work on local mode to work
without any hitches on Hadoop.

Are we still designing the system assuming that Pig will be stacked
on
top of other parallel frameworks?
FWIW, I appreciate this philosophical stance from Pig. Allowing
locally
tested scripts to be migrated to the cluster without breakage is a
noble
goal, and keeping the option of (one day) developing an alternative
execution environment for Pig that runs over HDFS but uses a richer
physical
set of operators than MapReduce would be great.

Of course, those of you who are running Pig in production will have
a much
better sense of the feasibility, rather than desirability, of this
philosophical stance.

Later,
Jeff

Re: Revisit Pig Philosophy?

Reply via email to