I am planning on opening an issue, and coordinating an initiative to develop a Calcite-focused benchmark.
This would lead to the development of the executable, reportable benchmark, and of the next publication aimed at another significant computer science conference or a journal. Before I submit a JIRA issue, i would like to get your feedback on what this benchmark might be both in terms of what it should benchmark, and now it should be implemented. Couple of preliminary thoughts that came out of the conversation with the co-authors of our SIGMOD paper are: * Optimizer runtime for complex queries (we could also compare with the runtime of executing the optimized query directly) * Calcite optimized query * Unoptimized query with the optimizer of the backend disabled * Unoptimized query with the optimizer of the backend enabled * Overhead of going through Calcite adapters vs. natively accessing the target DB * Comparison with other federated query processing engines such as Spark SQL and PrestoDB * use TCP-H or DS for this purpose * use Star Schema Benchmark (SSB) * Planning and execution time with queries that span across multiple systems (e.g. Postgres and Cassandra, Postgres and Pig, Pig and Cassandra). Follow approaches similar to: * https://www.slideshare.net/julianhyde/w-435phyde-3 * https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_hive-performance-tuning/content/ch_cost-based-optimizer.html * (How much of this is still relevant (Hive 0.14)? Can we use queries/benchmarks?) https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/ Please share your suggestions.
