> On Jun 28, 2017, at 10:58 PM, weijie tong <[email protected]> wrote: > > HI all: > anyone can explain the detail of the MonteCarlo algorithm to compute > the tiles of a Lattice? > It seems that MonteCarlo algorithm will simulate every possible query > of all kind of AggregateImpls ,and will choose the lowest cost's ( cost > model determined by the estimateCost() method of LatticeImpl ) > AggregateImpl to be the titles.
ExhaustiveLatticeAlgorithm will try every possible query (2^n if there are n attributes), whereas MonteCarloAlgorithm tries a set of random queries. Both algorithms are greedy algorithms. Each iteration, they assume that a set of aggregates have been chosen, and choose the best aggregate to add to it by calling getBenefit (which, despite its name, is a cost-benefit ratio). Repeat until there are enough aggregates. > I also find that the cost benefits of the choose AggregateImpls don't > play any role to the final output AggregateImpl. If you’re referring to the list of CostBenefit objects created at the end of the algorithm; yes, they are just info to put on the screen and prove that the algorithm has done a great job. But you’ll see that getBenefit is called in the inner loop. > please correct my opinion and show me the mathematical theory of the > MonteCarlo algorithm to choose the best aggregates . MonteCarloAlgorithm could be improved by taking into account historic queries, but I think it does a good job for the case where no previous queries are not known. The biggest problem with the algorithm is the amount of time spent gathering statistics. My work on data profiling [1] [2] will speed up getBenefit hugely because it will be able to answer aggregate.estimateRowCount() without executing a query. Julian [1] https://issues.apache.org/jira/browse/CALCITE-1616 <https://issues.apache.org/jira/browse/CALCITE-1616> [2] https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite <https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite>
