Ma Gang created KYLIN-3221:
------------------------------
Summary: Some improvements for lookup table
Key: KYLIN-3221
URL: https://issues.apache.org/jira/browse/KYLIN-3221
Project: Kylin
Issue Type: Improvement
Components: General
Reporter: Ma Gang
Assignee: Ma Gang
There are two limitations for current look table design:
# lookup table size is limited, because table snapshot need to be cached in
Kylin server, too large snapshot table will break the server.
# lookup table snapshot references are stored in all segments of the cube,
cannot support global snapshot table, the global snapshot table means when the
lookup table is updated, it will take effective for all segments.
To resolve the above limitations, we decide to do some improvements for the
existing lookup table design, below is the initial document, any comments and
suggestions are welcome.
h2. Metadata
Will add a new property in CubeDesc to describe how lookup tables will be
snapshot, it can be defined during the cube design
|{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
{{private}} {{List<SnapshotTableDesc> snapshotTableDescList =
Collections.emptyList();}}|
SnapshotTableDesc defines how table is stored and whether it is global or not,
currently we can support two types of store:
# "metaStore", table snapshot is stored in the metadata store, it is the same
as current design, and this is the default option.
# "hbaseStore', table snapshot is stored in an additional hbase table.
|{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
{{private}} {{String tableName;}}
{{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
{{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
{{@JsonProperty}}{{(}}{{"global"}}{{)}}
{{private}} {{boolean}} {{global = }}{{false}}{{;}}|
Add 'snapshots' property in CubeInstance, to store snapshots resource path for
each table, when the table snapshot is set to global in cube design:
|{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
{{private}} {{Map<String, String> snapshots; }}{{// tableName ->
tableResoucePath mapping}}|
Add new meta model ExtTableSnapshot to describe the extended table snapshot
information, the information is stored in a new metastore path:
/ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including
following info:
|{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
{{private}} {{String tableName;}}
{{@JsonProperty}}{{(}}{{"signature"}}{{)}}
{{private}} {{TableSignature signature;}}
{{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
{{private}} {{String storageLocationIdentifier;}}
{{@JsonProperty}}{{(}}{{"size"}}{{)}}
{{private}} {{long}} {{size;}}
{{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
{{private}} {{long}} {{rowCnt;}}|
Add new section in 'Advance Setting' tab when do cube design, user can set
table snapshot properties for each table, and by default, it is segment level
and store to metadata store
h2. Build
If user specify 'hbaseStore' storageType for any lookup table, will use
MapReduce job convert the hive source table to hfiles, and then bulk load
hfiles to HTable. So it will add two job steps to do the lookup table
materialization.
h2. HBase Lookup Table Schema
all data are stored in raw value
suppose the lookup table has primary keys: key1,key2
rowkey will be:
||2 bytes||len1 bytes||2 bytes||len2 bytes||
|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
1 column family c, multiple columns which column name is the index of the
column in the table definition
|c|
|1|2|...|
h2. Query
For key lookup query, directly call hbase get api to get entire row according
to key.
For queries that need fetch keys according to the derived columns, iterate all
rows to get related keys.
For queries that only hit the lookup table, iterate all rows and let calcite to
do aggregation and filter.
h2. Management
For each lookup table, admin can view how many snapshots it has in Kylin, and
can view each snapshot type/size information and which cube/segments the
snapshot is referenced, the snapshot tables that have no reference can be
deleted.
h2. Cleanup
When clean up metadata store, need to remove snapshot stored in HBase. And need
to clean up metadata store periodically by cronjob.
h2. Future
# Add coprocessor for lookup table, to improve the performance of lookup table
query, and queries that filter by derived columns.
# Add secondly index support for external snapshot table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)