Hi,

I am in the process of writing a proposal for this GSoC summer project, I read 
the VLDB paper by Dr. Viktor Leis who introduced a new benchmark suite for 
evaluating database query optimizers:  
http://www.vldb.org/pvldb/vol9/p204-leis.pdf. 

The paper did an end to end study of various components of query optimizer and 
isolated the impact of different components in producing query plans which 
ideally should be close to optimal. The scope of this project itself includes 
running the benchmark suite for Derby and develop a knowledge base for 
improving Derby optimizer in future. In the derby context I read about how 
Derby produces cardinality estimates and updates the statistics, derby’s cost 
model and the enumeration space derby uses.

 Dr. Viktor Leis in the paper has shown the importance of Cardinality estimates 
in producing good query plans relative to cost models and enumeration space. 
Even before isolating the impact of cardinalities on query plan (by injecting 
true cardinalities, to be taken as part of this project itself), I speculate 
that cardinality estimation has a lot of scope for improvement in Derby.

I am proposing the introduction of optional table sampling in order to improve 
the cardinality estimation, the cardinality estimates can then obtained 
reliably in presence of table samples specially when we are filtering on set of 
attributes that are mutually co-related which Derby currently ignores by taking 
in account assumption of uniformity and independence between attributes of the 
same table. I would like to specifically ask whether such optional sampling 
methods should be introduced in derby at the cost of leaving simplicity and 
light overhead of one dimensional histograms that derby optimizer currently 
uses. The scope of this project can then be adjusted accordingly as well.

Regards,
Harshvardhan Gupta

Reply via email to