Note that Calcite has a “plus” module which is a place to add other data sets (e.g. TPC-H, TPD-DS) and tests and benchmarks based on them. Also the “ubenchmark” module for micro-benchmarks. I don’t know whether the work you are planning would be a natural fit within these modules.
> On Feb 5, 2018, at 4:38 PM, Edmon Begoli <[email protected]> wrote: > > I am going to create two JIRA issues: > > 1. Development of the benchmark for Calcite. > > 2. An R&D development focused on benchmarking, performance evaluation, and > a study. > > Thank you, > Edmon > > On Mon, Feb 5, 2018 at 9:26 AM, Michael Mior <[email protected]> wrote: > >> One interesting exercise would also be to pick a popular benchmark (e.g. >> TPC-H) and just look at the plan produced by Calcite vs existing RDBMS >> optimizers (e.g. Postgres, MySQL). Along with performance analysis of the >> various options, it seems there's a paper in there. >> >> -- >> Michael Mior >> [email protected] >> >> 2018-02-03 23:21 GMT-05:00 Edmon Begoli <[email protected]>: >> >>> I am planning on opening an issue, and coordinating an initiative to >>> develop a Calcite-focused benchmark. >>> >>> This would lead to the development of the executable, reportable >> benchmark, >>> and of the next publication aimed at another significant computer science >>> conference or a journal. >>> >>> Before I submit a JIRA issue, i would like to get your feedback on what >>> this benchmark might be both in terms of what it should benchmark, and >> now >>> it should be implemented. >>> >>> Couple of preliminary thoughts that came out of the conversation with the >>> co-authors of our SIGMOD paper are: >>> >>> * Optimizer runtime for complex queries (we could also compare with the >>> runtime of executing the optimized query directly) >>> * Calcite optimized query >>> * Unoptimized query with the optimizer of the backend disabled >>> * Unoptimized query with the optimizer of the backend enabled >>> * Overhead of going through Calcite adapters vs. natively accessing the >>> target DB >>> * Comparison with other federated query processing engines such as Spark >>> SQL and PrestoDB >>> * use TCP-H or DS for this purpose >>> * use Star Schema Benchmark (SSB) >>> * Planning and execution time with queries that span across multiple >>> systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and >> Cassandra). >>> >>> >>> >>> Follow approaches similar to: >>> * https://www.slideshare.net/julianhyde/w-435phyde-3 >>> * >>> https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/ >>> bk_hive-performance-tuning/content/ch_cost-based-optimizer.html >>> * (How much of this is still relevant (Hive 0.14)? Can we use >>> queries/benchmarks?) >>> https://hortonworks.com/blog/hive-0-14-cost-based- >> optimizer-cbo-technical- >>> overview/ >>> >>> >>> Please share your suggestions. >>> >>
