Re: [DISCUSS] Columnar storage engine for Apache Kylin

ShaoFeng Shi Fri, 28 Sep 2018 22:59:30 -0700

Hi Yanghong,

Thanks for your question. I think it is not required that other engines
know how to read Kylin's storage, but it is a nice to have if possible. We
can extend the file format if Parquet or ORC couldn't match Kylin's
requirement, but not necessary to re-invent a new format.


Zhong, Yanghong <yangzh...@ebay.com.invalid> 于2018年9月29日周六 上午10:59写道：

> I have one question about the characteristics of Kylin columnar storage
> files. That is whether it should be a standard or common one. Since the
> data stored in the storage engine is Kylin specified, is it necessary for
> other engines to know how to build data into and how to read data from the
> storage engine?
>
> In my opinion, it's not necessary. And Kylin columnar storage files should
> be Kylin specified. We can leverage the advantages of other columnar files,
> like data skip indexes, bloom filters, dictionaries. Then create a new file
> format with Kylin specified requirements, like cuboid info.
>
> ------
> Best regards,
> Yanghong Zhong
>
>
> On 9/28/18, 2:15 PM, "ShaoFeng Shi" <shaofeng...@apache.org> wrote:
>
>     Hi Kylin developers.
>
>     HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
>     has been verified as a success which can support low latency & high
>     concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
>     users can get on average less than 1-second query response.
>
>     But we also see some limitations when putting Cubes into HBase; I
> shared
>     some of them in the HBaseConf Asia 2018[1] this August. The typical
>     limitations include:
>
>        - Rowkey is the primary index, no secondary index so far;
>
>     Filtering by row key’s prefix and suffix can get very different
> performance
>     result. So the user needs to do a good design about the row key;
> otherwise,
>     the query would be slow. This is difficult sometimes because the user
> might
>     not predict the filtering patterns ahead of cube design.
>
>        - HBase is a key-value instead of a columnar storage
>
>     Kylin combines multiple measures (columns) into fewer column families
> for
>     smaller data size (row key size is remarkable). This causes HBase often
>     needing to read more data than requested.
>
>        - HBase couldn't run on YARN
>
>     This makes the deployment and auto-scaling a little complicated,
> especially
>     in the cloud.
>
>     In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
>     debugging is also hard for normal developers. Now we’re planning to
> seek a
>     simple, light-weighted, read-only storage engine for Kylin. The new
>     solution should have the following characteristics:
>
>        - Columnar layout with compression for efficient I/O;
>        - Index by each column for quick filtering and seeking;
>        - MapReduce / Spark API for parallel processing;
>        - HDFS compliant for scalability and availability;
>        - Mature, stable and extensible;
>
>     With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
>     storages to Kylin is possible. Some companies like Kyligence Inc and
>     Meituan.com, have developed their customized storage engine for Kylin
> in
>     their product or platform. In their experience, columnar storage is a
> good
>     supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
>     their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
>     Beijing.
>
>     We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
>     Parquet is a standard columnar file format and has been widely
> supported by
>     many projects like Hive, Impala, Drill, etc. Parquet is adding the page
>     level column index to support fine-grained filtering.  Apache Spark can
>     provide the parallel computing over Parquet and can be deployed on
>     YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
>     computation are separated, which makes the scaling in/out much easier
> than
>     before. Benefiting from Spark's flexibility, we can not only push down
> more
>     computation from Kylin to the Hadoop cluster. Except for Parquet,
> Apache
>     ORC is also a candidate.
>
>     Now I raise this discussion to get your ideas about Kylin’s
> next-generation
>     storage engine. If you have good ideas or any related data, welcome
> discuss in
>     the community.
>
>     Thank you!
>
>     [1] Apache Kylin on HBase
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-data&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3D&amp;reserved=0
>     [2] Apache Kylin Plugin Architecture
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.html&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3D&amp;reserved=0
>     [3] 基于Druid的Kylin存储引擎实践
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.bcmeng.com%2Fpost%2Fkylin-on-druid.html--&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=A2j40L1%2BcoccgZSRGs4X%2F5TUDi2VQqjhdNoMThfJffA%3D&amp;reserved=0
>     Best regards,
>
>     Shaofeng Shi 史少锋
>
>
>

-- 
Best regards,

Shaofeng Shi 史少锋

Re: [DISCUSS] Columnar storage engine for Apache Kylin

Reply via email to