Amogh and I (developers at Qubole) have been working on a project - Quark -
https://github.com/qubole/quark/ - to provide a unified view of data spread
across many databases. Two concrete examples:
1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and cold
data is stored in HDFS and accessed through Apache Hive.
2. Cubes are stored in Redshift and the base tables are stored HDFS.
Similarly base tables are stored in Redshift and cubes are stored Postgres.

Data analysts will query hot data or cubes but have to often cross over to
the cold data or base tables. At scale this setup gets complicated in
multiple dimensions (no pun intended). Analysts have to keep track of which
dataset to use, they have to be trained to use different technologies and
interfaces etc.

So there is a requirement to provide a single interface to the data spread
across multiple data stores for e.g. through Tableau or Apache Zeppelin.
Quark is an optimizer based on Apache Calcite that models these
relationships as materialized views or lattices and reroutes queries to the
optimal dataset. Note that Quark is *not* a federation engine. It does not
join data across databases. It can integrate with Presto or Hive for
federation but the preferred option is to run a query in a single database.

This is an example where materialized views and cubes are setup between
Hive (on EMR) and Redshift:
https://github.com/qubole/quark/blob/master/examples/EMR.md

Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses
the optimizer for determining the dataset and database to run the queries
on. It uses the execution engine to run the query. It is distributed as a
JDBC jar and it uses Avatica for the JDBC implementation. We are very
grateful to the community and Julian Hyde for all the help. We have made a
few small contributions as part of the building Quark. We are pushing
Lattices and Materialization Service to its limits and have made major
changes. We will start a thread discussing the issues we faced and designs
to solve it. Hopefully we can contribute it back to the project through the
usual process.

The cube use case is very similar to Apache Kylin. I looked at Kylin quite
a bit initially and learned a lot from it. A couple of big differences are
that Quark does not build or maintain cubes and it does not insist on Hive,
Hadoop and HBase. These requirements were driven by our users who already
have an ETL process to build cubes. They've also decided on the
technologies - Redshift, Postgres or Oracle for e.g. Looks like there is a
push to generalize Apache Kylin -
https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use cases
outside of OLAP cubes like supporting copies of data which are sorted
differently like Vertica's projections or stored in a fast DWH.

To summarize, we have built another application on top of Apache Calcite
that our users are very excited about. Its another strong vote for the
quality and utility of Apache Calcite. We are very happy to be part of the
community and can hopefully contribute back.

Reply via email to