Hi Hongbin, Thanks for the interest. We are still maturing as an open source project :) I am setting up a google group. I'll send out the info once its in place. WRT to roadmap, the current focus is on maturity. It was an experimental project for the longest time and now we are starting to onboard users. We are unearthing issues. For e.g. I mentioned the scalability of the materialization service in another thread. In the medium term, we are planning to solve incremental cubes better, federation after optimizing the query by delegating execution to another engine like Presto and support for more engines. Qubole is definitely invested in the project for the forsee-able future. I'll setup a google group and I can answer any specific questions you have.
On Thu, Jan 21, 2016 at 12:51 PM hongbin ma <[email protected]> wrote: > hi Rajat > > this is Hongbin Ma from Apache Kylin, I'm very interested in Quark, which > in my opinion shares a lot in common with Quark. Actually I believe Kylin > it self may benefit from Quark, too. Can you also please share your roadmap > with the community? (People may be very interested in how sustainable your > corp can invest on Quark, etc.) > > Do you have a dev mail list now? I'd love to contribute to the project. > Things like mail list is what I personally preferred. > > > On Thu, Jan 21, 2016 at 2:39 PM, Rajat Venkatesh <[email protected]> > wrote: > > > Amogh and I (developers at Qubole) have been working on a project - > Quark - > > https://github.com/qubole/quark/ - to provide a unified view of data > > spread > > across many databases. Two concrete examples: > > 1. Hot data is stored in a data warehouse (Redshift, Vertica etc) and > cold > > data is stored in HDFS and accessed through Apache Hive. > > 2. Cubes are stored in Redshift and the base tables are stored HDFS. > > Similarly base tables are stored in Redshift and cubes are stored > Postgres. > > > > Data analysts will query hot data or cubes but have to often cross over > to > > the cold data or base tables. At scale this setup gets complicated in > > multiple dimensions (no pun intended). Analysts have to keep track of > which > > dataset to use, they have to be trained to use different technologies and > > interfaces etc. > > > > So there is a requirement to provide a single interface to the data > spread > > across multiple data stores for e.g. through Tableau or Apache Zeppelin. > > Quark is an optimizer based on Apache Calcite that models these > > relationships as materialized views or lattices and reroutes queries to > the > > optimal dataset. Note that Quark is *not* a federation engine. It does > not > > join data across databases. It can integrate with Presto or Hive for > > federation but the preferred option is to run a query in a single > database. > > > > This is an example where materialized views and cubes are setup between > > Hive (on EMR) and Redshift: > > https://github.com/qubole/quark/blob/master/examples/EMR.md > > > > Quark relies quite a bit on Apache Calcite for the heavy lifting. It uses > > the optimizer for determining the dataset and database to run the queries > > on. It uses the execution engine to run the query. It is distributed as a > > JDBC jar and it uses Avatica for the JDBC implementation. We are very > > grateful to the community and Julian Hyde for all the help. We have made > a > > few small contributions as part of the building Quark. We are pushing > > Lattices and Materialization Service to its limits and have made major > > changes. We will start a thread discussing the issues we faced and > designs > > to solve it. Hopefully we can contribute it back to the project through > the > > usual process. > > > > The cube use case is very similar to Apache Kylin. I looked at Kylin > quite > > a bit initially and learned a lot from it. A couple of big differences > are > > that Quark does not build or maintain cubes and it does not insist on > Hive, > > Hadoop and HBase. These requirements were driven by our users who already > > have an ETL process to build cubes. They've also decided on the > > technologies - Redshift, Postgres or Oracle for e.g. Looks like there is > a > > push to generalize Apache Kylin - > > https://issues.apache.org/jira/browse/KYLIN-1351. Finally we have use > > cases > > outside of OLAP cubes like supporting copies of data which are sorted > > differently like Vertica's projections or stored in a fast DWH. > > > > To summarize, we have built another application on top of Apache Calcite > > that our users are very excited about. Its another strong vote for the > > quality and utility of Apache Calcite. We are very happy to be part of > the > > community and can hopefully contribute back. > > > > > > -- > Regards, > > *Bin Mahone | 马洪宾* > Apache Kylin: http://kylin.io > Github: https://github.com/binmahone >
