[
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shaofeng SHI updated KYLIN-3221:
--------------------------------
Summary: Allow externalizing lookup table snapshot (was: Some improvements
for lookup table )
> Allow externalizing lookup table snapshot
> -----------------------------------------
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine, Metadata, Query Engine
> Reporter: Ma Gang
> Assignee: Ma Gang
> Priority: Major
> Fix For: v2.4.0
>
> Attachments: KYLIN-3221-web-error.png
>
>
> There are two limitations for current look table design:
> # lookup table size is limited, because table snapshot need to be cached in
> Kylin server, too large snapshot table will break the server.
> # lookup table snapshot references are stored in all segments of the cube,
> cannot support global snapshot table, the global snapshot table means when
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the
> existing lookup table design, below is the initial document, any comments and
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
> {{private}} {{List<SnapshotTableDesc> snapshotTableDescList =
> Collections.emptyList();}}|
> SnapshotTableDesc defines how table is stored and whether it is global or
> not, currently we can support two types of store:
> # "metaStore", table snapshot is stored in the metadata store, it is the
> same as current design, and this is the default option.
> # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
> {{private}} {{String tableName;}}
>
> {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
> {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>
> @JsonProperty("local_cache_enable")
> private boolean enableLocalCache = true;
>
> {{@JsonProperty}}{{(}}{{"global"}}{{)}}
> {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>
> Add 'snapshots' property in CubeInstance, to store snapshots resource path
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
> {{private}} {{Map<String, String> snapshots; }}{{// tableName ->
> tableResoucePath mapping}}|
>
> Add new meta model ExtTableSnapshot to describe the extended table snapshot
> information, the information is stored in a new metastore path:
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
> {{private}} {{String tableName;}}
>
> {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
> {{private}} {{TableSignature signature;}}
>
> {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
> {{private}} {{String storageLocationIdentifier;}}
>
> @JsonProperty("key_columns")
> private String[] keyColumns; // the key columns of the table
>
> @JsonProperty("storage_type")
> private String storageType;
>
> {{@JsonProperty}}{{(}}{{"size"}}{{)}}
> {{private}} {{long}} {{size;}}
>
> {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
> {{private}} {{long}} {{rowCnt;}}|
>
> Add new section in 'Advance Setting' tab when do cube design, user can set
> table snapshot properties for each table, and by default, it is segment level
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use
> MapReduce job convert the hive source table to hfiles, and then bulk load
> hfiles to HTable. So it will add two job steps to do the lookup table
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |shard|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
> the first 2 bytes is shard number, HBase table can be pre-split, the shard
> size is configurable through Kylin's properties:
> "kylin.snapshot.ext.shard-mb", default size is 500MB.
> 1 column family c, multiple columns which column name is the index of the
> column in the table definition
> |c|
> |1|2|...|
>
> h2. Query
> For key lookup query, directly call hbase get api to get entire row according
> to key (call local cache if there is local cache enable)
> For queries that need fetch keys according to the derived columns, iterate
> all rows to get related keys. (call local cache if there is local cache
> enable)
> For queries that only hit the lookup table, iterate all rows and let calcite
> to do aggregation and filter. (call local cache if there is local cache
> enable)
> h2. Management
> For each lookup table, admin can view how many snapshots it has in Kylin, and
> can view each snapshot type/size information and which cube/segments the
> snapshot is referenced, the snapshot tables that have no reference can be
> deleted.
> Add a new action button 'Lookup Refresh' for each cube, when click the
> button, a dialog will popup, let user choose which lookup table need to
> refresh, and if the table is not set to global, user can choose some or all
> segments that the related snapshot need to be refresh, then user can click
> 'submit' to submit a new job to build the table snapshot independently.
> h2. Cleanup
> When clean up metadata store, need to remove snapshot stored in HBase. And
> need to clean up metadata store periodically by cronjob.
> h2. Future
> # Add coprocessor for lookup table, to improve the performance of lookup
> table query, and queries that filter by derived columns.
> # Add secondly index support for external snapshot table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)