hi Rajat this is Hongbin Ma from Apache Kylin, I'm very interested in Quark, which in my opinion shares a lot in common with Quark. Actually I believe Kylin it self may benefit from Quark, too. Can you also please share your roadmap with the community? (People may be very interested in how sustainable your corp can invest on Quark, etc.)
Do you have a dev mail list now? I'd love to contribute to the project. Things like mail list is what I personally preferred. On Thu, Jan 21, 2016 at 2:39 PM, Rajat Venkatesh <[email protected]> wrote: > Amogh and I (developers at Qubole) have been working on a project - Quark - > https://github.com/qubole/quark/ - to provide a unified view of data > spread > across many databases. Two concrete examples: > 1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and cold > data is stored in HDFS and accessed through Apache Hive. > 2. Cubes are stored in Redshift and the base tables are stored HDFS. > Similarly base tables are stored in Redshift and cubes are stored Postgres. > > Data analysts will query hot data or cubes but have to often cross over to > the cold data or base tables. At scale this setup gets complicated in > multiple dimensions (no pun intended). Analysts have to keep track of which > dataset to use, they have to be trained to use different technologies and > interfaces etc. > > So there is a requirement to provide a single interface to the data spread > across multiple data stores for e.g. through Tableau or Apache Zeppelin. > Quark is an optimizer based on Apache Calcite that models these > relationships as materialized views or lattices and reroutes queries to the > optimal dataset. Note that Quark is *not* a federation engine. It does not > join data across databases. It can integrate with Presto or Hive for > federation but the preferred option is to run a query in a single database. > > This is an example where materialized views and cubes are setup between > Hive (on EMR) and Redshift: > https://github.com/qubole/quark/blob/master/examples/EMR.md > > Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses > the optimizer for determining the dataset and database to run the queries > on. It uses the execution engine to run the query. It is distributed as a > JDBC jar and it uses Avatica for the JDBC implementation. We are very > grateful to the community and Julian Hyde for all the help. We have made a > few small contributions as part of the building Quark. We are pushing > Lattices and Materialization Service to its limits and have made major > changes. We will start a thread discussing the issues we faced and designs > to solve it. Hopefully we can contribute it back to the project through the > usual process. > > The cube use case is very similar to Apache Kylin. I looked at Kylin quite > a bit initially and learned a lot from it. A couple of big differences are > that Quark does not build or maintain cubes and it does not insist on Hive, > Hadoop and HBase. These requirements were driven by our users who already > have an ETL process to build cubes. They've also decided on the > technologies - Redshift, Postgres or Oracle for e.g. Looks like there is a > push to generalize Apache Kylin - > https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use > cases > outside of OLAP cubes like supporting copies of data which are sorted > differently like Vertica's projections or stored in a fast DWH. > > To summarize, we have built another application on top of Apache Calcite > that our users are very excited about. Its another strong vote for the > quality and utility of Apache Calcite. We are very happy to be part of the > community and can hopefully contribute back. > -- Regards, *Bin Mahone | 马洪宾* Apache Kylin: http://kylin.io Github: https://github.com/binmahone
