The Load/Store redesign proposal has an interface that defines how
stats get represented; a loader that implements ResourceLoader will
pass statistics up into Pig, which will then take care of doing
whatever it needs to do with them. The specifics of how the stats get
loaded in by the loader are up to the implementation of the loader --
they can be read in from a metadata service, sampled on the fly,
stored in a metadata file, etc.
For simplicity, we are working with serialized JSON representations of
ResourceStatistics right now.
2009/11/6 RichardGUO Fei <gladiato...@hotmail.com>:
> Thanks for sharing. I look forward to seeing your work. I implemented a
> storage and want to connect Pig to my storage.
> In order to let the optimizer fully benefit from the histogram and the
> side-information of my storage, I am thinking of
> implementing a cost-based optimizer.
> How do you plan to pass in the statistics? So let's say that your input file
> is a plain-text log file, do you require the users to
> do a statistics themselves? Or do you plan to limit this to only certain
> types of storage?
>> Date: Thu, 5 Nov 2009 22:54:47 -0500
>> Subject: Re: How to clone a logical plan ?
>> From: dvrya...@gmail.com
>> To: email@example.com
>> At a high level, we are implementing the framework for propagating
>> statistics between Pig operators, and using said statistics to make
>> moderately intelligent decisions about Join types that should be used
>> (unless they are specified by the user). We do this in a fairly
>> brute-force manner, by generating all alternative plans (that part is
>> not working so hot right now, see subject) and costing them, choosing
>> the global minimum (there is some pruning happening, but not as much
>> as something like System R). As far as relation order inside a given
>> Join, we set that deterministically after choosing the join, as Pig
>> has specific preferences for where the largest relation should go for
>> a given join type. Once we have join type selection working, other
>> optimizations can be added -- the tricky part is making sure the
>> costing functions can't produce drastically wrong results.
>> All the work is happening at the logical layer, between the rule-based
>> optimizer and LogToPhysTranslator.
>> 2009/11/5 RichardGUO Fei <gladiato...@hotmail.com>:
>> > Hi,
>> > I am also doing a cost-based optimizer. So I am interested in knowing some
>> > of the specs that you are after.
>> > Thanks,
>> > Richard
>> > _________________________________________________________________
>> > 上Windows Live 中国首页，下载Messenger2009安全版！
>> > http://www.windowslive.cn
> 上Windows Live 中国首页，下载Messenger2009安全版！