Hi Alan,
Thanks for the detailed review.

After getting Daniel's feedback (and grokking the relationship between
Pig's logical and physical operators, which is a little different than
that described in the literature), we agree that the proper place to
put the optimizer is at the logical layer, although we will need to
compile to the physical layer to get cost estimates (for example, the
number of generated MR jobs, which have associated
network/queueing/startup costs). In order to adaptively adjust
estimates, we will need to be able to trace back from an executed MR
job ("job set", really, as some operations like order and join may
require several jobs that are considered a single unit) to the logical
operators this job covered. Adding that ability will have the
additional benefit of enabling more helpful debugging output to end
users by associating a failed MR job with what it was supposed to be

Totally agree with respect to PigServer and MapReduceLauncher.  Making
PigServer an actual "server" would be good, but is somewhat orthogonal
to this work.

Great to know you are working on statistics, looking forward to
looking at the proposal.  Are you working on just data stats or also
execution stats (time per operator per record, that sort of thing)?


On Fri, Sep 11, 2009 at 1:56 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
> This is a good start at adding a cost based optimizer to Pig.  I have a
> number of comments:
> 1) Your argument for putting it in the physical layer rather than the
> logical is that the logical layer does not know physical statistics.  This
> need not be true.  You suggest adding a getStatistics call to the loader to
> give statistics.  The logical layer can make this call and make decisions
> based on the results without understanding the underlying physical layer.
>  It seems that the real reason you want to put the optimizer in the physical
> layer is, rather than trying to do predictive statistics (such as we guess
> this join will result in a 2x data explosion) you want to see the results of
> actual MR jobs and then make decisions.  This seems like a reasonable choice
> for a couple of reasons:  a) statistical guesses are hard to get right, and
> Pig has limited statistics to begin with; b) since Pig Latin scripts can be
> arbitrarily long, bad guesses at the beginning will have a worse ripple
> effect than bad guesses in a SQL optimizer.
> 2) The changes you propose in Pig Server are quite complex.  Would it be
> possible instead to put the changes in MapReduceLauncher?  It could run the
> first MR job in a Pig Latin script, look at the results, and then rerun your
> CBO on the remaining physical plan and re-translate this to a new MR plan
> and resubmit.  This would require annotations to the MR plan to indicate
> where in a physical plan the MR boundaries fall, so that correct portions of
> the original physical plan could be used for reoptimization and
> recompilation.  But it would contain the complexity of your changes to
> MapReduceLauncher instead of scattering them through the entire system.
> 3) On adding getStatistics, I am currently working on a proposal to make a
> number of changes to the load interface, including getStatistics.  I hope to
> publish that proposal by next week.  Similarly I am working on a proposal of
> how Pig will interact with metadata systems (such as Owl) which I also hope
> to propose next week.  We will be actively working in these areas because we
> need them for our SQL implementation.  So, one, you'll get a lot of this for
> free; two, we should stay connected on these things so what we implement
> works for what you need.
> Alan.
> On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:
>> Whoops :-)
>> Here's the Google doc:
>> http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdA&hl=en
>> -Dmitriy
>> On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasan<s...@yahoo-inc.com>
>> wrote:
>>> Dmitriy and Gang,
>>> The mailing list does not allow attachments. Can you post it on a
>>> website and just send the URL ?
>>> Thanks,
>>> Santhosh
>>> -----Original Message-----
>>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>>> Sent: Tuesday, September 01, 2009 9:48 AM
>>> To: pig-dev@hadoop.apache.org
>>> Subject: Request for feedback: cost-based optimizer
>>> Hi everyone,
>>> Attached is a (very) preliminary document outlining a rough design we
>>> are proposing for a cost-based optimizer for Pig.
>>> This is being done as a capstone project by three CMU Master's students
>>> (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
>>> necessarily meant for immediate incorporation into the Pig codebase,
>>> although it would be nice if it, or parts of it, are found to be useful
>>> in the mainline.
>>> We would love to get some feedback from the developer community
>>> regarding the ideas expressed in the document, any concerns about the
>>> design, suggestions for improvement, etc.
>>> Thanks,
>>> Dmitriy, Ashutosh, Tejal

