Re: Kylin Cube Performance

hongbin ma Thu, 04 Aug 2016 09:04:47 -0700

If you have a limit of 100, kylin is SUPPOSED to be far more efficient.
However there's currently a issue here that might cause limit clause being
overlooked (https://issues.apache.org/jira/browse/KYLIN-1936)


I'm working on fixing KYLIN-1936, it will be fixed in 1.5.4.

On Thu, Aug 4, 2016 at 11:36 PM, Jason Hale <ja...@koddi.com> wrote:

> True, but even if there's a limit of 100, it still has to scan all records?
> Perhaps I'm just used to how Postgres handles that as it only scans the
> necessary records, not the entire set if it's limited. I can rethink the
> way I approach it if that's the case.
>
> On Thu, Aug 4, 2016 at 10:31 AM, hongbin ma <mahong...@apache.org> wrote:
>
> > Hi Jason
> >
> > As Shaofeng explained it's not reasonable to expect sub-second latency if
> > you're returning tens of millions of records. You data model is quite
> > simple and you don't have costly measure like distinct count etc., so
> kylin
> > should be performant on normal OLAP queries.
> >
> > Another advise: if the cardinality of the mandatory dimensions (CHILD_ID
> > and SITE_ID) are very high, you might isolate such dimensions into a
> > separate "aggregation group", so that 1. queries not touching these
> > dimensions can be performant 2. calculate less cuboids. Please refer to
> > http://kylin.apache.org/blog/2016/02/18/new-aggregation-group/
> >
> > On Thu, Aug 4, 2016 at 11:11 PM, ShaoFeng Shi <shaofeng...@apache.org>
> > wrote:
> >
> > > The log is pretty clear; the cuboid is exact match, but the scan count
> is
> > > massive:
> > >
> > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to
> > 992
> > > Total scan count: 12306477
> > >
> > > Please add where condition to narrow down the result set as much as
> > > possible; It doesn't make sense for an OLAP query to return millions of
> > > records;
> > >
> > > 2016-08-04 13:05 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > >
> > > > Sure, see kylin.log below:
> > > >
> > > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > > controller.QueryController:175 : The original query:  SELECT
> > SUM(clicks)
> > > > FROM hpa_reporting2 GROUP BY site_id, child_id, search_type,
> hotel_id,
> > > > report_date
> > > > 2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
> > > > service.QueryService:266 : The corrected query: SELECT SUM(clicks)
> FROM
> > > > hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
> > > > report_date
> > > > LIMIT 50000
> > > > 2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:48
> > > > : The project manager's reference is
> > > > org.apache.kylin.metadata.project.ProjectManager@3a3735a5
> > > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:60
> > > > : Find candidates by table DEFAULT.HPA_REPORTING2 and
> > project=KODDI_DEV :
> > > > org.apache.kylin.query.routing.Candidate@51ed1b3b
> > > > 2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:49
> > > > : Applying rule: class
> > > > org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
> > > > realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations
> > > > after:
> > > > [hpa_reporting2_cube_clone(CUBE)]
> > > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:49
> > > > : Applying rule: class
> > > > org.apache.kylin.query.routing.rules.RealizationSortRule,
> realizations
> > > > before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
> > > > [hpa_reporting2_cube_clone(CUBE)]
> > > > 2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7]
> > > routing.QueryRouter:72
> > > > : The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And
> the
> > > > final chosen one is the first one
> > > > 2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
> > > > enumerator.OLAPEnumerator:107 : query storage...
> > > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeStorageQuery:239 : exactAggregation is true
> > > > 2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeStorageQuery:357 : Enable limit 50000
> > > > 2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
> > > > hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
> > > > SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
> > > > 2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
> > > > rawScanBytesString 56 bytes
> > > > 2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
> > > > hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below
> > with
> > > > 1
> > > > separate raw scans, shard part of start/end key is set to 0
> > > > 2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7]
> > v2.CubeHBaseRPC:271
> > > :
> > > > Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992
> to
> > > 992
> > > > Start:
> > > > \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > > x00\x00\x00\x00\x00
> > > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\
> > > > x00\x00\x00\x00\x00)
> > > > Stop:
> > > >  \x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > > xFF\xFF\xFF\xFF\xFF\x00
> > > > (\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\
> > > > xFF\xFF\xFF\xFF\xFF\x00),
> > > > No Fuzzy Key
> > > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting
> from
> > > > shard 2, scan range count 1
> > > > 2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
> > > > v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is:
> > 99000
> > > > 2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
> > > > enumerator.OLAPEnumerator:127 : return TupleIterator...
> > > > 2016-08-04 00:47:52,773 INFO  [pool-6-thread-1]
> > > v2.CubeHBaseEndpointRPC:351
> > > > : <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from
> > > HTable
> > > > KYLIN_RIK9O18H07 Shard
> > > > \x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\
> > > > x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\
> > > > x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\
> > > > x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\
> > > > x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
> > > > on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477.
> Total
> > > > filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU
> > usage:
> > > > 0.24348086721950246, server physical mem left: 7.195234304E9, server
> > swap
> > > > mem left:0.0.Etc message: start latency: 15@1,agg done@13760
> ,compress
> > > > done@16562,server stats done@16562,
> > > > debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal
> Complete:
> > > > true.
> > > > 2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1]
> > util.CompressionUtils:67
> > > :
> > > > Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
> > > > 2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
> > > > threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual
> > > > running,
> > > > 0 ready, 12 others
> > > > 2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
> > > > service.QueryService:399 : Scan count for each storageContext:
> > 12306477,
> > > > 2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
> > > > controller.QueryController:197 : Stats of SQL response: isException:
> > > false,
> > > > duration: 56152, total scan count 12306477
> > > > 2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
> > > > sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
> > > > references was reached while attempting to calculate the size of the
> > > object
> > > > graph. Severe performance degradation could occur if the sizing
> > operation
> > > > continues. This can be avoided by setting the CacheManger or Cache
> > > > <sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding
> > > stop
> > > > points with @IgnoreSizeOf annotations. If performance degradation is
> > NOT
> > > an
> > > > issue at the configured limit, raise the limit value using the
> > > CacheManager
> > > > or Cache <sizeOfPolicy> elements maxDepth attribute. For more
> > > information,
> > > > see the Ehcache configuration documentation.
> > > > 2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
> > > > service.QueryService:250 :
> > > > ==========================[QUERY]===============================
> > > > SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id,
> child_id,
> > > > search_type, hotel_id, report_date
> > > > User: ADMIN
> > > > Success: true
> > > > Duration: 56.152
> > > > Project: koddi_dev
> > > > Realization Names: [hpa_reporting2_cube_clone]
> > > > Cuboid Ids: [992]
> > > > Total scan count: 12306477
> > > > Result row count: 50000
> > > > Accept Partial: true
> > > > Is Partial Result: false
> > > > Hit Exception Cache: false
> > > > Storage cache used: false
> > > > Message: null
> > > > ==========================[QUERY]===============================
> > > >
> > > > On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <shaofeng...@apache.org
> >
> > > > wrote:
> > > >
> > > > > Hi Jason, could you please provide the full log since sending query
> > to
> > > > and
> > > > > getting result back? The key information is which cuboid is used
> for
> > > the
> > > > > query, cuboid exact match or fuzzy match, how many records be
> scanned
> > > and
> > > > > how long it tooks; Thanks.
> > > > >
> > > > > 2016-08-03 23:19 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > >
> > > > > > Yes, it would have to do post-aggregation in that case, but the
> > > strange
> > > > > > thing is that query was running fast (about 1 second), while
> > queries
> > > > with
> > > > > > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP
> > BY
> > > > > > site_id, child_id, report_date, hotel_id". This query will take
> > about
> > > > 106
> > > > > > seconds, but it shouldn't need to do any post-aggregation so I
> > would
> > > > > think
> > > > > > it should return much quicker than that from the respective
> cuboid.
> > > > > >
> > > > > > Here's the explain plan:
> > > > > > OLAPToEnumerableConverter
> > > > > > OLAPProjectRel(EXPR$0=[$4])
> > > > > > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > > > > > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > > > > > HOTEL_ID=[$2], CLICKS=[$10])
> > > > > > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1,
> 2,
> > 3,
> > > > 4,
> > > > > 5,
> > > > > > 6, 7, 8, 9, 10, 11]])
> > > > > >
> > > > > > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <
> > shaofeng...@apache.org
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> > > > > "Mandatory"
> > > > > > > dimension, which means they will not be aggregated in cube
> build
> > > > phase
> > > > > > for
> > > > > > > all combinations.
> > > > > > >
> > > > > > > So when you run a query like  "SELECT SUM(clicks) FROM
> reporting
> > > > GROUP
> > > > > BY
> > > > > > > search_type", Kylin will use the combination  "SITE_ID" +
> > > "CHILD_ID"
> > > > +
> > > > > > > "SEARCH_TYPE" to serve, there will be post-aggregation in
> > runtime;
> > > > The
> > > > > > > performance is much depent on the cardinality of "SITE_ID" and
> > > > > > "CHILD_ID".
> > > > > > >
> > > > > > >
> > > > > > > 2016-08-02 23:08 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > >
> > > > > > > > I've looked over the optimization options before, but did not
> > > > notice
> > > > > > the
> > > > > > > > rowkey ordering. I can try this and see if this helps me.
> This
> > is
> > > > the
> > > > > > > only
> > > > > > > > thing I see that I can attempt to optimize further in the
> > design,
> > > > but
> > > > > > > I'll
> > > > > > > > provide my cube design below. I only have one measure to keep
> > it
> > > > > > simple:
> > > > > > > >
> > > > > > > > {
> > > > > > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > > > > > >   "version": "1.5.2",
> > > > > > > >   "name": "hpa_reporting2_cube",
> > > > > > > >   "description": "",
> > > > > > > >   "dimensions": [
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "REPORT_DATE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "SEARCH_TYPE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "HOTEL_ID",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "CHILD_ID",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "COUNTRY",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "DEVICE_TYPE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "STAY_LENGTH",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "TRUE_RANK_AG",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "ROOM_BUNDLE",
> > > > > > > >       "derived": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > > > > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > > > > > >       "column": "SITE_ID",
> > > > > > > >       "derived": null
> > > > > > > >     }
> > > > > > > >   ],
> > > > > > > >   "measures": [
> > > > > > > >     {
> > > > > > > >       "name": "_COUNT_",
> > > > > > > >       "function": {
> > > > > > > >         "expression": "COUNT",
> > > > > > > >         "parameter": {
> > > > > > > >           "type": "constant",
> > > > > > > >           "value": "1",
> > > > > > > >           "next_parameter": null
> > > > > > > >         },
> > > > > > > >         "returntype": "bigint"
> > > > > > > >       },
> > > > > > > >       "dependent_measure_ref": null
> > > > > > > >     },
> > > > > > > >     {
> > > > > > > >       "name": "CLICKS",
> > > > > > > >       "function": {
> > > > > > > >         "expression": "SUM",
> > > > > > > >         "parameter": {
> > > > > > > >           "type": "column",
> > > > > > > >           "value": "CLICKS",
> > > > > > > >           "next_parameter": null
> > > > > > > >         },
> > > > > > > >         "returntype": "decimal"
> > > > > > > >       },
> > > > > > > >       "dependent_measure_ref": null
> > > > > > > >     }
> > > > > > > >   ],
> > > > > > > >   "rowkey": {
> > > > > > > >     "rowkey_columns": [
> > > > > > > >       {
> > > > > > > >         "column": "REPORT_DATE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "SEARCH_TYPE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "HOTEL_ID",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "CHILD_ID",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "COUNTRY",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "DEVICE_TYPE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "STAY_LENGTH",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "TRUE_RANK_AG",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "ROOM_BUNDLE",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       },
> > > > > > > >       {
> > > > > > > >         "column": "SITE_ID",
> > > > > > > >         "encoding": "dict",
> > > > > > > >         "isShardBy": false
> > > > > > > >       }
> > > > > > > >     ]
> > > > > > > >   },
> > > > > > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > > > > > >   "last_modified": 1470070273935,
> > > > > > > >   "model_name": "hpa_reporting_model2",
> > > > > > > >   "null_string": null,
> > > > > > > >   "hbase_mapping": {
> > > > > > > >     "column_family": [
> > > > > > > >       {
> > > > > > > >         "name": "F1",
> > > > > > > >         "columns": [
> > > > > > > >           {
> > > > > > > >             "qualifier": "M",
> > > > > > > >             "measure_refs": [
> > > > > > > >               "_COUNT_",
> > > > > > > >               "CLICKS"
> > > > > > > >             ]
> > > > > > > >           }
> > > > > > > >         ]
> > > > > > > >       }
> > > > > > > >     ]
> > > > > > > >   },
> > > > > > > >   "aggregation_groups": [
> > > > > > > >     {
> > > > > > > >       "includes": [
> > > > > > > >         "REPORT_DATE",
> > > > > > > >         "SEARCH_TYPE",
> > > > > > > >         "HOTEL_ID",
> > > > > > > >         "CHILD_ID",
> > > > > > > >         "COUNTRY",
> > > > > > > >         "DEVICE_TYPE",
> > > > > > > >         "STAY_LENGTH",
> > > > > > > >         "TRUE_RANK_AG",
> > > > > > > >         "ROOM_BUNDLE",
> > > > > > > >         "SITE_ID"
> > > > > > > >       ],
> > > > > > > >       "select_rule": {
> > > > > > > >         "hierarchy_dims": [],
> > > > > > > >         "mandatory_dims": [
> > > > > > > >           "SITE_ID",
> > > > > > > >           "CHILD_ID"
> > > > > > > >         ],
> > > > > > > >         "joint_dims": [
> > > > > > > >           [
> > > > > > > >             "ROOM_BUNDLE",
> > > > > > > >             "TRUE_RANK_AG"
> > > > > > > >           ]
> > > > > > > >         ]
> > > > > > > >       }
> > > > > > > >     }
> > > > > > > >   ],
> > > > > > > >   "notify_list": [],
> > > > > > > >   "status_need_notify": [
> > > > > > > >     "ERROR",
> > > > > > > >     "DISCARDED",
> > > > > > > >     "SUCCEED"
> > > > > > > >   ],
> > > > > > > >   "partition_date_start": 0,
> > > > > > > >   "partition_date_end": 3153600000000,
> > > > > > > >   "auto_merge_time_ranges": [
> > > > > > > >     604800000,
> > > > > > > >     2419200000
> > > > > > > >   ],
> > > > > > > >   "retention_range": 0,
> > > > > > > >   "engine_type": 2,
> > > > > > > >   "storage_type": 2,
> > > > > > > >   "override_kylin_properties": {}
> > > > > > > > }
> > > > > > > >
> > > > > > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <
> > > > shaofeng...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > As Yiming mentioned, the cube design matters for the
> > > performance
> > > > of
> > > > > > > both
> > > > > > > > > build and query; please check "Optimize Cube" in the
> document
> > > web
> > > > > > page
> > > > > > > > and
> > > > > > > > > do optimizaiton as much as possible;
> > > > > > > > >
> > > > > > > > > Besides, the cluster's capacity and Hadoop configuration is
> > > also
> > > > an
> > > > > > > > > important factor; Try to identify the bottleneck and then
> > > > optimize
> > > > > or
> > > > > > > add
> > > > > > > > > capacity.
> > > > > > > > >
> > > > > > > > > From 1.5 Kylin ships with two cubing algorithm; the steps
> > > "Build
> > > > > > > > > N-Dimension Cuboid" are the legacy "Layered" cubing
> > algorithm;
> > > > They
> > > > > > > will
> > > > > > > > be
> > > > > > > > > skipped when Kylin selects to use the new "Fast" cubing
> > > > algorithm,
> > > > > > > which
> > > > > > > > is
> > > > > > > > > the "Build Cube" step after them. Please click the hadoop
> > link
> > > in
> > > > > > that
> > > > > > > > step
> > > > > > > > > to inspect the MR job's statistics;
> > > > > > > > >
> > > > > > > > > Hope this helps to some extend;
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <
> > liuyiming....@gmail.com
> > > >:
> > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > Cube design is the performance key for Kylin, not only
> > query,
> > > > but
> > > > > > > also
> > > > > > > > > cube
> > > > > > > > > > building process. How to select dimensions, how to define
> > the
> > > > > > > > > relationship
> > > > > > > > > > between dimensions, how to select encode method, how to
> > > define
> > > > > > > measure,
> > > > > > > > > > even how to choose the Hbase key order will have a
> > > significant
> > > > > > impact
> > > > > > > > on
> > > > > > > > > > performance.  There are quite a few wonderful documents
> > > > > introducing
> > > > > > > how
> > > > > > > > > to
> > > > > > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > > > > > >
> > > > > > > > > > One more thing, if you could share your cube design, you
> > > would
> > > > > get
> > > > > > > help
> > > > > > > > > > easier here.
> > > > > > > > > >
> > > > > > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <ja...@koddi.com>:
> > > > > > > > > >
> > > > > > > > > > > I'm setting up a test case for a portion of our
> dataset,
> > to
> > > > > > > evaluate
> > > > > > > > > > Kylin
> > > > > > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > > > > > >
> > > > > > > > > > > The cube building process is taking about 5-6 hours
> with
> > > > > > > ~69,000,000
> > > > > > > > > > > records and 10 dimensions. I'm not sure if that's the
> > > > expected
> > > > > > > build
> > > > > > > > > > time,
> > > > > > > > > > > but the other problem is the query performance after
> > > building
> > > > > the
> > > > > > > > cube.
> > > > > > > > > > >
> > > > > > > > > > > All queries were tested with a very simple query (e.g.
> > > SELECT
> > > > > > > > > SUM(clicks)
> > > > > > > > > > > FROM reporting GROUP BY search_type)
> > > > > > > > > > >
> > > > > > > > > > > Grouping by 1 or 2 dimensions gives me very responsive
> > > > queries
> > > > > > > > (under 2
> > > > > > > > > > > seconds), but adding more dimensions drastically
> > increases
> > > > the
> > > > > > > query
> > > > > > > > > time
> > > > > > > > > > > (over 1 minute and it times out through hbase). I would
> > > > expect
> > > > > > > these
> > > > > > > > > > > queries to have all similar query times since they
> should
> > > > query
> > > > > > the
> > > > > > > > > > > respective cuboid, so I'm not sure why the performance
> > > would
> > > > > > > suffer.
> > > > > > > > I
> > > > > > > > > > > didn't set up any special rules for the cube, but
> during
> > > the
> > > > > > build
> > > > > > > > > > process
> > > > > > > > > > > it showed all the N-dimension cubes and the log simply
> > said
> > > > > > > > 'skipped'.
> > > > > > > > > > >
> > > > > > > > > > > Is there something I'm missing in the configuration?
> > > > > > > > > > >
> > > > > > > > > > > I have a HDP cluster with 3 nodes and 1 client node on
> > > which
> > > > > > Kylin
> > > > > > > is
> > > > > > > > > > > installed. Do I need to adjust the hadoop
> configuration.
> > > I'm
> > > > > > using
> > > > > > > > most
> > > > > > > > > > of
> > > > > > > > > > > the default HDP settings.
> > > > > > > > > > >
> > > > > > > > > > > What more information can I provide?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > With Warm regards
> > > > > > > > > >
> > > > > > > > > > Yiming Liu (刘一鸣)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Shaofeng Shi
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Shaofeng Shi
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > > Shaofeng Shi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*

Re: Kylin Cube Performance

Reply via email to