In the cube definition, you defined "SITE_ID", "CHILD_ID" as "Mandatory" dimension, which means they will not be aggregated in cube build phase for all combinations.
So when you run a query like "SELECT SUM(clicks) FROM reporting GROUP BY search_type", Kylin will use the combination "SITE_ID" + "CHILD_ID" + "SEARCH_TYPE" to serve, there will be post-aggregation in runtime; The performance is much depent on the cardinality of "SITE_ID" and "CHILD_ID". 2016-08-02 23:08 GMT+08:00 Jason Hale <[email protected]>: > I've looked over the optimization options before, but did not notice the > rowkey ordering. I can try this and see if this helps me. This is the only > thing I see that I can attempt to optimize further in the design, but I'll > provide my cube design below. I only have one measure to keep it simple: > > { > "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030", > "version": "1.5.2", > "name": "hpa_reporting2_cube", > "description": "", > "dimensions": [ > { > "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE", > "table": "DEFAULT.HPA_REPORTING2", > "column": "REPORT_DATE", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE", > "table": "DEFAULT.HPA_REPORTING2", > "column": "SEARCH_TYPE", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID", > "table": "DEFAULT.HPA_REPORTING2", > "column": "HOTEL_ID", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.CHILD_ID", > "table": "DEFAULT.HPA_REPORTING2", > "column": "CHILD_ID", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.COUNTRY", > "table": "DEFAULT.HPA_REPORTING2", > "column": "COUNTRY", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE", > "table": "DEFAULT.HPA_REPORTING2", > "column": "DEVICE_TYPE", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH", > "table": "DEFAULT.HPA_REPORTING2", > "column": "STAY_LENGTH", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG", > "table": "DEFAULT.HPA_REPORTING2", > "column": "TRUE_RANK_AG", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE", > "table": "DEFAULT.HPA_REPORTING2", > "column": "ROOM_BUNDLE", > "derived": null > }, > { > "name": "DEFAULT.HPA_REPORTING2.SITE_ID", > "table": "DEFAULT.HPA_REPORTING2", > "column": "SITE_ID", > "derived": null > } > ], > "measures": [ > { > "name": "_COUNT_", > "function": { > "expression": "COUNT", > "parameter": { > "type": "constant", > "value": "1", > "next_parameter": null > }, > "returntype": "bigint" > }, > "dependent_measure_ref": null > }, > { > "name": "CLICKS", > "function": { > "expression": "SUM", > "parameter": { > "type": "column", > "value": "CLICKS", > "next_parameter": null > }, > "returntype": "decimal" > }, > "dependent_measure_ref": null > } > ], > "rowkey": { > "rowkey_columns": [ > { > "column": "REPORT_DATE", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "SEARCH_TYPE", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "HOTEL_ID", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "CHILD_ID", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "COUNTRY", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "DEVICE_TYPE", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "STAY_LENGTH", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "TRUE_RANK_AG", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "ROOM_BUNDLE", > "encoding": "dict", > "isShardBy": false > }, > { > "column": "SITE_ID", > "encoding": "dict", > "isShardBy": false > } > ] > }, > "signature": "KixlKWxevr6mO+UlSaR5ig==", > "last_modified": 1470070273935, > "model_name": "hpa_reporting_model2", > "null_string": null, > "hbase_mapping": { > "column_family": [ > { > "name": "F1", > "columns": [ > { > "qualifier": "M", > "measure_refs": [ > "_COUNT_", > "CLICKS" > ] > } > ] > } > ] > }, > "aggregation_groups": [ > { > "includes": [ > "REPORT_DATE", > "SEARCH_TYPE", > "HOTEL_ID", > "CHILD_ID", > "COUNTRY", > "DEVICE_TYPE", > "STAY_LENGTH", > "TRUE_RANK_AG", > "ROOM_BUNDLE", > "SITE_ID" > ], > "select_rule": { > "hierarchy_dims": [], > "mandatory_dims": [ > "SITE_ID", > "CHILD_ID" > ], > "joint_dims": [ > [ > "ROOM_BUNDLE", > "TRUE_RANK_AG" > ] > ] > } > } > ], > "notify_list": [], > "status_need_notify": [ > "ERROR", > "DISCARDED", > "SUCCEED" > ], > "partition_date_start": 0, > "partition_date_end": 3153600000000, > "auto_merge_time_ranges": [ > 604800000, > 2419200000 > ], > "retention_range": 0, > "engine_type": 2, > "storage_type": 2, > "override_kylin_properties": {} > } > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <[email protected]> > wrote: > > > Hi Jason, > > > > As Yiming mentioned, the cube design matters for the performance of both > > build and query; please check "Optimize Cube" in the document web page > and > > do optimizaiton as much as possible; > > > > Besides, the cluster's capacity and Hadoop configuration is also an > > important factor; Try to identify the bottleneck and then optimize or add > > capacity. > > > > From 1.5 Kylin ships with two cubing algorithm; the steps "Build > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They will > be > > skipped when Kylin selects to use the new "Fast" cubing algorithm, which > is > > the "Build Cube" step after them. Please click the hadoop link in that > step > > to inspect the MR job's statistics; > > > > Hope this helps to some extend; > > > > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <[email protected]>: > > > > > Hi Jason, > > > > > > Cube design is the performance key for Kylin, not only query, but also > > cube > > > building process. How to select dimensions, how to define the > > relationship > > > between dimensions, how to select encode method, how to define measure, > > > even how to choose the Hbase key order will have a significant impact > on > > > performance. There are quite a few wonderful documents introducing how > > to > > > do this, http://kylin.apache.org/docs15/ . > > > > > > One more thing, if you could share your cube design, you would get help > > > easier here. > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <[email protected]>: > > > > > > > I'm setting up a test case for a portion of our dataset, to evaluate > > > Kylin > > > > and I'm not seeing the performance that I would expect. > > > > > > > > The cube building process is taking about 5-6 hours with ~69,000,000 > > > > records and 10 dimensions. I'm not sure if that's the expected build > > > time, > > > > but the other problem is the query performance after building the > cube. > > > > > > > > All queries were tested with a very simple query (e.g. SELECT > > SUM(clicks) > > > > FROM reporting GROUP BY search_type) > > > > > > > > Grouping by 1 or 2 dimensions gives me very responsive queries > (under 2 > > > > seconds), but adding more dimensions drastically increases the query > > time > > > > (over 1 minute and it times out through hbase). I would expect these > > > > queries to have all similar query times since they should query the > > > > respective cuboid, so I'm not sure why the performance would suffer. > I > > > > didn't set up any special rules for the cube, but during the build > > > process > > > > it showed all the N-dimension cubes and the log simply said > 'skipped'. > > > > > > > > Is there something I'm missing in the configuration? > > > > > > > > I have a HDP cluster with 3 nodes and 1 client node on which Kylin is > > > > installed. Do I need to adjust the hadoop configuration. I'm using > most > > > of > > > > the default HDP settings. > > > > > > > > What more information can I provide? > > > > > > > > > > > > > > > > -- > > > With Warm regards > > > > > > Yiming Liu (刘一鸣) > > > > > > > > > > > -- > > Best regards, > > > > Shaofeng Shi > > > -- Best regards, Shaofeng Shi
