yes on LoadMetadata being how the user provides the data, and yes on only
attaching to logical operators.

D

On Fri, Nov 5, 2010 at 10:38 AM, Renato Marroquín Mogrovejo <
[email protected]> wrote:

> @Gianmarco
> It would be in a way. A cost based optimizer would be awesome, but when
> dealing with large amounts of data, important things such as statistics to
> make accurate estimations are not that easy to get or to maintain. And
> about
> just hacking into the code, I guess it is my fault for not explaining
> myself. The approach would be to add a new rule into the logical optimizer
> framework that Pig 0.8 uses.
> I know it would be awesome to have a cost based optimizer but I guess we
> will have to get there little by little (:
>
> @Dmitriy
> I am not trying to write the whole optimizer here, I would like to add a
> rule to the logical optimizer.
> So I will try to fake them to try isolating the optimization problem. So
> for
> example there is sort information in the ResourceSchema and I would need to
> check this information up to change the join operation, but this is what I
> don't understand, so ok I will fake this information, but how would this
> information would come from outside? I mean the user would provide this
> information through the LoadMetadata interface?
> And when you are saying to attach this ResourceStatistics to any operator
> instances, you are talking about in any operator which needs this at
> logical
> level (e.g. LOJoin), so no changes would have to be made at the physical
> level right?
>
> Please correct me if I am mistaken. Thanks in advanced.
>
>
> Renato M.
>
>
>
>
>
> 2010/11/4 Dmitriy Ryaboy <[email protected]>
>
> > 1. Collection is kind of a separate problem. You can write an optimizer
> > from
> > the position of "if we have stats, we use them" and punt on this.  Assume
> > there is a something that provides the stats. Fake them while you are
> > dealing with the optimization problem.
> >
> > 2. Attach ResourceStatistics to the different operator instances and
> mutate
> > them as appropriate while walking down the operators.
> >
> > -D
> >
> > On Tue, Nov 2, 2010 at 10:09 PM, Renato Marroquín Mogrovejo <
> > [email protected]> wrote:
> >
> > > A couple of weeks ago on a list discussion Alan suggested me an
> > interesting
> > > project which consists in the idea of switching join operators based on
> > > some
> > > data properties e.g. at logical plan compiling time, a specific join
> > > operator might be chosen, but maybe this operator is probably not the
> > most
> > > suitable for the data. For example, if both data sources are ordered by
> > its
> > > key, then a merge join would be the best operator.
> > > But at this point I dunno how I should proceed. I have some 'general
> > > doubts'
> > > about the approach that should be taken.
> > >
> > > 1. Data statistics can be passed to the LoadFunc by using the
> > LoadMetadata
> > > interface right? But how should these statistics be collected? should I
> > > modify the LOLoad class to use a different LoadFunc?
> > > 2. And how would these statistics be passed to the optimizer to change
> > (if
> > > it were the case) the join operator?
> > >
> > > Please correct me if I am wrong (which I probably am), and any
> > suggestions
> > > or comments are highly appreciated.
> > > Thanks in advance.
> > >
> > >
> > > Renato M.
> > >
> >
>

Reply via email to