This is a good start at adding a cost based optimizer to Pig. I have
a number of comments:
1) Your argument for putting it in the physical layer rather than the
logical is that the logical layer does not know physical statistics.
This need not be true. You suggest adding a getStatistics call to the
loader to give statistics. The logical layer can make this call and
make decisions based on the results without understanding the
underlying physical layer. It seems that the real reason you want to
put the optimizer in the physical layer is, rather than trying to do
predictive statistics (such as we guess this join will result in a 2x
data explosion) you want to see the results of actual MR jobs and then
make decisions. This seems like a reasonable choice for a couple of
reasons: a) statistical guesses are hard to get right, and Pig has
limited statistics to begin with; b) since Pig Latin scripts can be
arbitrarily long, bad guesses at the beginning will have a worse
ripple effect than bad guesses in a SQL optimizer.
2) The changes you propose in Pig Server are quite complex. Would it
be possible instead to put the changes in MapReduceLauncher? It could
run the first MR job in a Pig Latin script, look at the results, and
then rerun your CBO on the remaining physical plan and re-translate
this to a new MR plan and resubmit. This would require annotations to
the MR plan to indicate where in a physical plan the MR boundaries
fall, so that correct portions of the original physical plan could be
used for reoptimization and recompilation. But it would contain the
complexity of your changes to MapReduceLauncher instead of scattering
them through the entire system.
3) On adding getStatistics, I am currently working on a proposal to
make a number of changes to the load interface, including
getStatistics. I hope to publish that proposal by next week.
Similarly I am working on a proposal of how Pig will interact with
metadata systems (such as Owl) which I also hope to propose next
week. We will be actively working in these areas because we need them
for our SQL implementation. So, one, you'll get a lot of this for
free; two, we should stay connected on these things so what we
implement works for what you need.
Alan.
On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:
Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
-Dmitriy
On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<s...@yahoo-
inc.com> wrote:
Dmitriy and Gang,
The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?
Thanks,
Santhosh
-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer
Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's
students
(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be
useful
in the mainline.
We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.
Thanks,
Dmitriy, Ashutosh, Tejal