Amogh and I (developers at Qubole) have been working on a project - Quark - https://github.com/qubole/quark/ - to provide a unified view of data spread across many databases. Two concrete examples: 1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and cold data is stored in HDFS and accessed through Apache Hive. 2. Cubes are stored in Redshift and the base tables are stored HDFS. Similarly base tables are stored in Redshift and cubes are stored Postgres.
Data analysts will query hot data or cubes but have to often cross over to the cold data or base tables. At scale this setup gets complicated in multiple dimensions (no pun intended). Analysts have to keep track of which dataset to use, they have to be trained to use different technologies and interfaces etc. So there is a requirement to provide a single interface to the data spread across multiple data stores for e.g. through Tableau or Apache Zeppelin. Quark is an optimizer based on Apache Calcite that models these relationships as materialized views or lattices and reroutes queries to the optimal dataset. Note that Quark is *not* a federation engine. It does not join data across databases. It can integrate with Presto or Hive for federation but the preferred option is to run a query in a single database. This is an example where materialized views and cubes are setup between Hive (on EMR) and Redshift: https://github.com/qubole/quark/blob/master/examples/EMR.md Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses the optimizer for determining the dataset and database to run the queries on. It uses the execution engine to run the query. It is distributed as a JDBC jar and it uses Avatica for the JDBC implementation. We are very grateful to the community and Julian Hyde for all the help. We have made a few small contributions as part of the building Quark. We are pushing Lattices and Materialization Service to its limits and have made major changes. We will start a thread discussing the issues we faced and designs to solve it. Hopefully we can contribute it back to the project through the usual process. The cube use case is very similar to Apache Kylin. I looked at Kylin quite a bit initially and learned a lot from it. A couple of big differences are that Quark does not build or maintain cubes and it does not insist on Hive, Hadoop and HBase. These requirements were driven by our users who already have an ETL process to build cubes. They've also decided on the technologies - Redshift, Postgres or Oracle for e.g. Looks like there is a push to generalize Apache Kylin - https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use cases outside of OLAP cubes like supporting copies of data which are sorted differently like Vertica's projections or stored in a fast DWH. To summarize, we have built another application on top of Apache Calcite that our users are very excited about. Its another strong vote for the quality and utility of Apache Calcite. We are very happy to be part of the community and can hopefully contribute back.
