I have one question about the characteristics of Kylin columnar storage files. 
That is whether it should be a standard or common one. Since the data stored in 
the storage engine is Kylin specified, is it necessary for other engines to 
know how to build data into and how to read data from the storage engine? 

In my opinion, it's not necessary. And Kylin columnar storage files should be 
Kylin specified. We can leverage the advantages of other columnar files, like 
data skip indexes, bloom filters, dictionaries. Then create a new file format 
with Kylin specified requirements, like cuboid info.

------
Best regards,
Yanghong Zhong


On 9/28/18, 2:15 PM, "ShaoFeng Shi" <shaofeng...@apache.org> wrote:

    Hi Kylin developers.
    
    HBase has been Kylin’s storage engine since the first day; Kylin on HBase
    has been verified as a success which can support low latency & high
    concurrency queries on a very large data scale. Thanks to HBase, most Kylin
    users can get on average less than 1-second query response.
    
    But we also see some limitations when putting Cubes into HBase; I shared
    some of them in the HBaseConf Asia 2018[1] this August. The typical
    limitations include:
    
       - Rowkey is the primary index, no secondary index so far;
    
    Filtering by row key’s prefix and suffix can get very different performance
    result. So the user needs to do a good design about the row key; otherwise,
    the query would be slow. This is difficult sometimes because the user might
    not predict the filtering patterns ahead of cube design.
    
       - HBase is a key-value instead of a columnar storage
    
    Kylin combines multiple measures (columns) into fewer column families for
    smaller data size (row key size is remarkable). This causes HBase often
    needing to read more data than requested.
    
       - HBase couldn't run on YARN
    
    This makes the deployment and auto-scaling a little complicated, especially
    in the cloud.
    
    In one word, HBase is complicated to be Kylin’s storage. The maintenance,
    debugging is also hard for normal developers. Now we’re planning to seek a
    simple, light-weighted, read-only storage engine for Kylin. The new
    solution should have the following characteristics:
    
       - Columnar layout with compression for efficient I/O;
       - Index by each column for quick filtering and seeking;
       - MapReduce / Spark API for parallel processing;
       - HDFS compliant for scalability and availability;
       - Mature, stable and extensible;
    
    With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
    storages to Kylin is possible. Some companies like Kyligence Inc and
    Meituan.com, have developed their customized storage engine for Kylin in
    their product or platform. In their experience, columnar storage is a good
    supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
    their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
    Beijing.
    
    We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
    Parquet is a standard columnar file format and has been widely supported by
    many projects like Hive, Impala, Drill, etc. Parquet is adding the page
    level column index to support fine-grained filtering.  Apache Spark can
    provide the parallel computing over Parquet and can be deployed on
    YARN/Mesos and Kubernetes. With this combination, the data persistence and
    computation are separated, which makes the scaling in/out much easier than
    before. Benefiting from Spark's flexibility, we can not only push down more
    computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
    ORC is also a candidate.
    
    Now I raise this discussion to get your ideas about Kylin’s next-generation
    storage engine. If you have good ideas or any related data, welcome discuss 
in
    the community.
    
    Thank you!
    
    [1] Apache Kylin on HBase
    
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-data&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3D&amp;reserved=0
    [2] Apache Kylin Plugin Architecture
    
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.html&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3D&amp;reserved=0
    [3] 基于Druid的Kylin存储引擎实践 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.bcmeng.com%2Fpost%2Fkylin-on-druid.html--&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=A2j40L1%2BcoccgZSRGs4X%2F5TUDi2VQqjhdNoMThfJffA%3D&amp;reserved=0
    Best regards,
    
    Shaofeng Shi 史少锋
    

Reply via email to