@Gianmarco
It would be in a way. A cost based optimizer would be awesome, but when
dealing with large amounts of data, important things such as statistics to
make accurate estimations are not that easy to get or to maintain. And about
just hacking into the code, I guess it is my fault for not explaining
myself. The approach would be to add a new rule into the logical optimizer
framework that Pig 0.8 uses.
I know it would be awesome to have a cost based optimizer but I guess we
will have to get there little by little (:

@Dmitriy
I am not trying to write the whole optimizer here, I would like to add a
rule to the logical optimizer.
So I will try to fake them to try isolating the optimization problem. So for
example there is sort information in the ResourceSchema and I would need to
check this information up to change the join operation, but this is what I
don't understand, so ok I will fake this information, but how would this
information would come from outside? I mean the user would provide this
information through the LoadMetadata interface?
And when you are saying to attach this ResourceStatistics to any operator
instances, you are talking about in any operator which needs this at logical
level (e.g. LOJoin), so no changes would have to be made at the physical
level right?

Please correct me if I am mistaken. Thanks in advanced.


Renato M.





2010/11/4 Dmitriy Ryaboy <[email protected]>

> 1. Collection is kind of a separate problem. You can write an optimizer
> from
> the position of "if we have stats, we use them" and punt on this.  Assume
> there is a something that provides the stats. Fake them while you are
> dealing with the optimization problem.
>
> 2. Attach ResourceStatistics to the different operator instances and mutate
> them as appropriate while walking down the operators.
>
> -D
>
> On Tue, Nov 2, 2010 at 10:09 PM, Renato Marroquín Mogrovejo <
> [email protected]> wrote:
>
> > A couple of weeks ago on a list discussion Alan suggested me an
> interesting
> > project which consists in the idea of switching join operators based on
> > some
> > data properties e.g. at logical plan compiling time, a specific join
> > operator might be chosen, but maybe this operator is probably not the
> most
> > suitable for the data. For example, if both data sources are ordered by
> its
> > key, then a merge join would be the best operator.
> > But at this point I dunno how I should proceed. I have some 'general
> > doubts'
> > about the approach that should be taken.
> >
> > 1. Data statistics can be passed to the LoadFunc by using the
> LoadMetadata
> > interface right? But how should these statistics be collected? should I
> > modify the LOLoad class to use a different LoadFunc?
> > 2. And how would these statistics be passed to the optimizer to change
> (if
> > it were the case) the join operator?
> >
> > Please correct me if I am wrong (which I probably am), and any
> suggestions
> > or comments are highly appreciated.
> > Thanks in advance.
> >
> >
> > Renato M.
> >
>

Reply via email to