Exactly; Thank you jiatao for the comments! JiaTao Tao <taojia...@gmail.com> 于2018年10月25日周四 下午6:12写道:
> As far as I'm concerned, using Parquet as Kylin's storage format is pretty > appropriate. From the aspect of integrating Spark, Spark made a lot of > optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and > lazy dict decoding, etc. > > > And here are my thoughts about integrating Spark and our query engine. As > Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this > as a small table and we can read this cuboid as a DataFrame directly, which > can be directly queried by Spark, a bit like this: > > ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx"). > (We need to implement some Kylin's advanced aggregations, as for some > Kylin's basic aggregations like sum/min/max, we can use Spark's directly) > > > > *Compare to our old query engine, the advantages are as follows:* > > > > 1. It is distributed! Our old query engine will get all data into a query > node and then calculate, it's a single point of failure and often leads OOM > when in a huge amount of data. > > > > 2. It is simple and easy to debug(every step is very clear and > transparent), you can collect data after every single phase, > e.g.(filter/aggregation/projection, etc.), so you can easily check out > which operation/phase went wrong. Our old query engine uses Calcite for > post-calculation, it's difficult when pinpointing problems, especially when > relating to code generation, and you cannot insert your own logic during > computation. > > > > 3. We can fully enjoy all efforts that Spark made for optimizing > performance, e.g. Catalyst/Tungsten, etc. > > > > 4. It is easy for unit tests, you can test every step separately, which > could reduce the testing granularity of Kylin's query engine. > > > > 5. Thanks to Spark's DataSource API, we can change Parquet to other data > formats easily. > > > > 6. A lot of upstream tools for Spark like many machine learning tools can > directly be integrated with us. > > > > ================== > > ====================================================================================================================== > > Hi Kylin developers. > > > > HBase has been Kylin’s storage engine since the first day; Kylin on > HBase > > has been verified as a success which can support low latency & high > > concurrency queries on a very large data scale. Thanks to HBase, most > Kylin > > users can get on average less than 1-second query response. > > > > But we also see some limitations when putting Cubes into HBase; I > shared > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > limitations include: > > > > - Rowkey is the primary index, no secondary index so far; > > > > Filtering by row key’s prefix and suffix can get very different > performance > > result. So the user needs to do a good design about the row key; > otherwise, > > the query would be slow. This is difficult sometimes because the user > might > > not predict the filtering patterns ahead of cube design. > > > > - HBase is a key-value instead of a columnar storage > > > > Kylin combines multiple measures (columns) into fewer column families > for > > smaller data size (row key size is remarkable). This causes HBase often > > needing to read more data than requested. > > > > - HBase couldn't run on YARN > > > > This makes the deployment and auto-scaling a little complicated, > especially > > in the cloud. > > > > In one word, HBase is complicated to be Kylin’s storage. The > maintenance, > > debugging is also hard for normal developers. Now we’re planning to > seek a > > simple, light-weighted, read-only storage engine for Kylin. The new > > solution should have the following characteristics: > > > > - Columnar layout with compression for efficient I/O; > > - Index by each column for quick filtering and seeking; > > - MapReduce / Spark API for parallel processing; > > - HDFS compliant for scalability and availability; > > - Mature, stable and extensible; > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding > multiple > > storages to Kylin is possible. Some companies like Kyligence Inc and > > Meituan.com, have developed their customized storage engine for Kylin > in > > their product or platform. In their experience, columnar storage is a > good > > supplement for the HBase engine. Kaisen Kang from Meituan.com has > shared > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > > Beijing. > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the next > phase. > > Parquet is a standard columnar file format and has been widely > supported by > > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > > level column index to support fine-grained filtering. Apache Spark can > > provide the parallel computing over Parquet and can be deployed on > > YARN/Mesos and Kubernetes. With this combination, the data persistence > and > > computation are separated, which makes the scaling in/out much easier > than > > before. Benefiting from Spark's flexibility, we can not only push down > more > > computation from Kylin to the Hadoop cluster. Except for Parquet, > Apache > > ORC is also a candidate. > > > > Now I raise this discussion to get your ideas about Kylin’s > next-generation > > storage engine. If you have good ideas or any related data, welcome > discuss in > > the community. > > > > Thank you! > > > > [1] Apache Kylin on HBase > > > > https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data > > [2] Apache Kylin Plugin Architecture > > https://kylin.apache.org/development/plugin_arch.html > > [3] 基于Druid的Kylin存储引擎实践 > https://blog.bcmeng.com/post/kylin-on-druid.html-- > > Best regards, > > > > Shaofeng Shi 史少锋 > -- Best regards, Shaofeng Shi 史少锋