Re: Re: [DISCUSS] Columnar storage engine for Apache Kylin

ShaoFeng Shi Sat, 29 Sep 2018 00:10:11 -0700

Hi Gang, very good questions, that's why we need to raise such a discussion
publicly. Please check my comments below started with [shaofengshi]. Feel
free to comment.

1. Is it possible to locate a cuboid quickly in a parquet file? How to save
cuboid metadata info in the parquet's FileMetaData, just in the metadata's
key/value pair?

[shaofengshi]: There are a couple ways to achieve this.

A simple way is, different cuboids can be organized into different files.
The cuboid ID can be used as a subfolder name (if the cuboid number is not
that big), or use it as the file prefix; this solution may cause many small
files when the cube is small. Another solution is to combine multiple
cuboids' data in one file (sharding). In this case, the cuboid ID can be a
column.

2. I notice that there is schema field in parquet's FileMetaData, but in a
cube, different cuboids have different schemas, so we just save the basic
cuboid schema in the schema field?  Will this cause storage waste?

[shaofengshi]: To be simple, we can use the same schema for all cuboids.
There will be metadata overhead as some columns are empty, but it is minor
compared with the data size. If using different schemas, the code would be
a little complicated. This needs some verification I think, not determined
yet.

3. Can parquet support extension to add index easily, like bitmap index or
B tree index for each column?

[shaofengshi]: I think it is extensible. Parquet is adding the column page
index, which is a lightweight index. We can follow the way to implement
another type of index page, but that involves many changes, should be very
careful. Or, we can separately store the index in another fast storage.
Actually, I'm not sure whether the bitmap index is worth to build because
Kylin already builds many cuboids. For large cuboids, we can sort the data
by the high cardinality columns, in that case, the file/rowgroup/page level
min/max indices might be enough for filtering. For small cuboids (no high
cardinality columns), as the file is small, parquet's dictionary filtering
should be good enough.  I believe the cube planner will play an important
role in finding out which cuboids are worth to calculate. This is a
difference Kylin with other engines: we can do more things in the build
phase.

4. Do we need to build rpc server? if just use yarn to schedule spark tasks
to do query, start/stop jvm may take seconds, then most queries will be
slower than using HBase. Of course, it is more scalable, and some queries
maybe faster.
[shaofengshi]: We can have a long-running spark application acting as the
query engine, when Kylin receives the query, submit to spark without
starting overhead. Kyligence has this in their product so there is no risk
on this.

1. Use customized columnar format, it is more flexible, we can add Kylin
specific concepts in the storage, like a cuboid, etc. also it will be easy
to add a different type of index as we need. The disadvantage is needing
more effort to define the format and development(cannot leverage existing
lib to read/write, and need to take care of compression), also cube data
file cannot be used by other projects(Do we have this needs?).

[shaofengshi]: I agree with you, this is a trade-off. If we want to
leverage other projects' achievements, directly collaborate with them. When
they grow, we grow. If we build our own wheel, we need to face many
challenges, and one day we may fall behind. The most key success factor of
Kylin is, in my mind, we play with a group of success projects: Hive,
MapReduce, HBase, Calcite, Spark, etc. This makes we can focus more on our
value, and due to this, we can be easily accepted by users.

2. Use local storage rather than HDFS, like Kudu/Druid/ClickHouse. The
advantage of this solution is the query performance will be very good, and
everything can be controlled by Kylin. The disadvantage is needing more
effort to do the development, especially for the cluster management,
failover, scalability.

[shaofengshi]: This is also an important consideration. Use HDFS instead of
local disk, we may lose data locality and some performance, but we gain the
scalability and stability, sometimes this is more important than
performance. Regarding to the data reading performance, there are some ways
to improve if in HDFS; for example using a relatively small, dedicated
HDFS/Spark cluster for Kylin's query, the spark will allocate an executor
which in the same machine with the data block to get data locality.
Besides, some cache technologies like Alluxio, Ignite can provide layed
(memory -> ssd -> hdd), memory speed, LRU cache for HDFS files, they are
transparent to the application, make the architecture flexible.

Ma Gang <[email protected]> 于2018年9月29日周六 下午12:32写道：

> I like parquet, it is very efficient format and supported by various
> projects, but there are some questions if we use parquet as the cube
> storage format:
>
>
> 1. Is it possible to locate a cuboid quickly in a parquet file? How to
> save cuboid metadata info in the parquet's FileMetaData, just in the
> metadata's key/value pair?
>
>
> 2. I notice that there is schema field in parquet's FileMetaData, but in a
> cube, different cuboids have different schemas, so we just save the basic
> cuboid schema in the schema field?  Will this cause storage waste?
>
>
> 3. Can parquet support extension to add index easily, like bitmap index or
> B tree index for each column?
>
>
> 4. Do we need to build rpc server? if just use yarn to schedule spark
> tasks to do query, start/stop jvm may take seconds, then most queries will
> be slower than using HBase. Of course, it is more scalable, and some
> queries maybe faster.
>
>
> Besides using parquet/orc, I think there are two other options:
>
>
> 1. Use customized columnar format, it is more flexible, we can add Kylin
> specific concepts in the storage, like cuboid, etc. also it will be easy to
> add different type index as we need. The disadvantage is need more effort
> to define the format and development(cannot leverage existing lib to
> read/write, and need to take care of compression), also cube data file
> cannot be used by other projects(Do we have this needs?).
>
>
> 2. Use local storage rather than HDFS, like Kudu/Druid/ClickHouse.
> Advantage of this solution is the query performance will be very good, and
> everything can be controlled by Kylin. Disvantage is need more effort to do
> the development, especially for the cluster management, fail over,
> scalability.
>
>
>
>
>
>
>
>
> At 2018-09-29 10:53:35, "Zhong, Yanghong" <[email protected]>
> wrote:
> >I have one question about the characteristics of Kylin columnar storage
> files. That is whether it should be a standard or common one. Since the
> data stored in the storage engine is Kylin specified, is it necessary for
> other engines to know how to build data into and how to read data from the
> storage engine?
> >
> >In my opinion, it's not necessary. And Kylin columnar storage files
> should be Kylin specified. We can leverage the advantages of other columnar
> files, like data skip indexes, bloom filters, dictionaries. Then create a
> new file format with Kylin specified requirements, like cuboid info.
> >
> >------
> >Best regards,
> >Yanghong Zhong
> >
> >
> >On 9/28/18, 2:15 PM, "ShaoFeng Shi" <[email protected]> wrote:
> >
> >    Hi Kylin developers.
> >
> >    HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
> >    has been verified as a success which can support low latency & high
> >    concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
> >    users can get on average less than 1-second query response.
> >
> >    But we also see some limitations when putting Cubes into HBase; I
> shared
> >    some of them in the HBaseConf Asia 2018[1] this August. The typical
> >    limitations include:
> >
> >       - Rowkey is the primary index, no secondary index so far;
> >
> >    Filtering by row key’s prefix and suffix can get very different
> performance
> >    result. So the user needs to do a good design about the row key;
> otherwise,
> >    the query would be slow. This is difficult sometimes because the user
> might
> >    not predict the filtering patterns ahead of cube design.
> >
> >       - HBase is a key-value instead of a columnar storage
> >
> >    Kylin combines multiple measures (columns) into fewer column families
> for
> >    smaller data size (row key size is remarkable). This causes HBase
> often
> >    needing to read more data than requested.
> >
> >       - HBase couldn't run on YARN
> >
> >    This makes the deployment and auto-scaling a little complicated,
> especially
> >    in the cloud.
> >
> >    In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
> >    debugging is also hard for normal developers. Now we’re planning to
> seek a
> >    simple, light-weighted, read-only storage engine for Kylin. The new
> >    solution should have the following characteristics:
> >
> >       - Columnar layout with compression for efficient I/O;
> >       - Index by each column for quick filtering and seeking;
> >       - MapReduce / Spark API for parallel processing;
> >       - HDFS compliant for scalability and availability;
> >       - Mature, stable and extensible;
> >
> >    With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
> >    storages to Kylin is possible. Some companies like Kyligence Inc and
> >    Meituan.com, have developed their customized storage engine for Kylin
> in
> >    their product or platform. In their experience, columnar storage is a
> good
> >    supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
> >    their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup
> in
> >    Beijing.
> >
> >    We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
> >    Parquet is a standard columnar file format and has been widely
> supported by
> >    many projects like Hive, Impala, Drill, etc. Parquet is adding the
> page
> >    level column index to support fine-grained filtering.  Apache Spark
> can
> >    provide the parallel computing over Parquet and can be deployed on
> >    YARN/Mesos and Kubernetes. With this combination, the data
> persistence and
> >    computation are separated, which makes the scaling in/out much easier
> than
> >    before. Benefiting from Spark's flexibility, we can not only push
> down more
> >    computation from Kylin to the Hadoop cluster. Except for Parquet,
> Apache
> >    ORC is also a candidate.
> >
> >    Now I raise this discussion to get your ideas about Kylin’s
> next-generation
> >    storage engine. If you have good ideas or any related data, welcome
> discuss in
> >    the community.
> >
> >    Thank you!
> >
> >    [1] Apache Kylin on HBase
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-data&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3D&amp;reserved=0
> >    [2] Apache Kylin Plugin Architecture
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.html&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3D&amp;reserved=0
> >    [3] 基于Druid的Kylin存储引擎实践
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.bcmeng.com%2Fpost%2Fkylin-on-druid.html--&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=A2j40L1%2BcoccgZSRGs4X%2F5TUDi2VQqjhdNoMThfJffA%3D&amp;reserved=0
> >    Best regards,
> >
> >    Shaofeng Shi 史少锋
> >
> >
>

-- 
Best regards,

Shaofeng Shi 史少锋

Re: Re: [DISCUSS] Columnar storage engine for Apache Kylin

Reply via email to