Hi, I am in the process of writing a proposal for this GSoC summer project, I read the VLDB paper by Dr. Viktor Leis who introduced a new benchmark suite for evaluating database query optimizers: http://www.vldb.org/pvldb/vol9/p204-leis.pdf.
The paper did an end to end study of various components of query optimizer and isolated the impact of different components in producing query plans which ideally should be close to optimal. The scope of this project itself includes running the benchmark suite for Derby and develop a knowledge base for improving Derby optimizer in future. In the derby context I read about how Derby produces cardinality estimates and updates the statistics, derby’s cost model and the enumeration space derby uses. Dr. Viktor Leis in the paper has shown the importance of Cardinality estimates in producing good query plans relative to cost models and enumeration space. Even before isolating the impact of cardinalities on query plan (by injecting true cardinalities, to be taken as part of this project itself), I speculate that cardinality estimation has a lot of scope for improvement in Derby. I am proposing the introduction of optional table sampling in order to improve the cardinality estimation, the cardinality estimates can then obtained reliably in presence of table samples specially when we are filtering on set of attributes that are mutually co-related which Derby currently ignores by taking in account assumption of uniformity and independence between attributes of the same table. I would like to specifically ask whether such optional sampling methods should be introduced in derby at the cost of leaving simplicity and light overhead of one dimensional histograms that derby optimizer currently uses. The scope of this project can then be adjusted accordingly as well. Regards, Harshvardhan Gupta