Re: Kylin Cube Performance

Jason Hale Wed, 03 Aug 2016 22:06:47 -0700

Sure, see kylin.log below:

2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
controller.QueryController:175 : The original query:  SELECT SUM(clicks)
FROM hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
report_date
2016-08-04 00:47:35,839 INFO  [http-bio-7070-exec-7]
service.QueryService:266 : The corrected query: SELECT SUM(clicks) FROM
hpa_reporting2 GROUP BY site_id, child_id, search_type, hotel_id,
report_date
LIMIT 50000
2016-08-04 00:47:35,908 INFO  [http-bio-7070-exec-7] routing.QueryRouter:48
: The project manager's reference is
org.apache.kylin.metadata.project.ProjectManager@3a3735a5
2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7] routing.QueryRouter:60
: Find candidates by table DEFAULT.HPA_REPORTING2 and project=KODDI_DEV :
org.apache.kylin.query.routing.Candidate@51ed1b3b
2016-08-04 00:47:35,909 INFO  [http-bio-7070-exec-7] routing.QueryRouter:49
: Applying rule: class
org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule,
realizations before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
[hpa_reporting2_cube_clone(CUBE)]
2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7] routing.QueryRouter:49
: Applying rule: class
org.apache.kylin.query.routing.rules.RealizationSortRule, realizations
before: [hpa_reporting2_cube_clone(CUBE)], realizations after:
[hpa_reporting2_cube_clone(CUBE)]
2016-08-04 00:47:35,910 INFO  [http-bio-7070-exec-7] routing.QueryRouter:72
: The realizations remaining: [hpa_reporting2_cube_clone(CUBE)] And the
final chosen one is the first one
2016-08-04 00:47:35,975 DEBUG [http-bio-7070-exec-7]
enumerator.OLAPEnumerator:107 : query storage...
2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
v2.CubeStorageQuery:239 : exactAggregation is true
2016-08-04 00:47:35,976 INFO  [http-bio-7070-exec-7]
v2.CubeStorageQuery:357 : Enable limit 50000
2016-08-04 00:47:35,977 DEBUG [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:257 : New scanner for current segment
hpa_reporting2_cube_clone[19700101000000_20160828000000] will use
SCAN_FILTER_AGGR_CHECKMEM as endpoint's behavior
2016-08-04 00:47:35,979 DEBUG [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:313 : Serialized scanRequestBytes 836 bytes,
rawScanBytesString 56 bytes
2016-08-04 00:47:35,979 INFO  [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:315 : The scan 31b2dd4c for segment
hpa_reporting2_cube_clone[19700101000000_20160828000000] is as below with 1
separate raw scans, shard part of start/end key is set to 0
2016-08-04 00:47:35,980 INFO  [http-bio-7070-exec-7] v2.CubeHBaseRPC:271 :
Visiting hbase table KYLIN_RIK9O18H07: cuboid exact match, from 992 to 992
Start:
\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\x00\x00\x00\x00\x00
(\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\x00\x00\x00\x00\x00\x00\x00\x00\x00)
Stop:
 
\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00
(\x00\x00\x00\x00\x00\x00\x00\x00\x03\xE0\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00),
No Fuzzy Key
2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:320 : Submitting rpc to 1 shards starting from
shard 2, scan range count 1
2016-08-04 00:47:35,981 INFO  [http-bio-7070-exec-7]
v2.CubeHBaseEndpointRPC:103 : Timeout for ExpectedSizeIterator is: 99000
2016-08-04 00:47:35,981 DEBUG [http-bio-7070-exec-7]
enumerator.OLAPEnumerator:127 : return TupleIterator...
2016-08-04 00:47:52,773 INFO  [pool-6-thread-1] v2.CubeHBaseEndpointRPC:351
: <sub-thread for GTScanRequest 31b2dd4c> Endpoint RPC returned from HTable
KYLIN_RIK9O18H07 Shard
\x4B\x59\x4C\x49\x4E\x5F\x52\x49\x4B\x39\x4F\x31\x38\x48\x30\x37\x2C\x00\x02\x2C\x31\x34\x37\x30\x31\x35\x35\x33\x31\x34\x39\x33\x37\x2E\x61\x33\x61\x35\x34\x37\x39\x61\x32\x63\x37\x61\x61\x64\x30\x36\x33\x66\x30\x33\x64\x63\x34\x65\x31\x30\x36\x33\x61\x33\x61\x37\x2E
on host: ip-10-0-0-157.ec2.internal.Total scanned row: 12306477. Total
filtered/aggred row: 0. Time elapsed in EP: 16562(ms). Server CPU usage:
0.24348086721950246, server physical mem left: 7.195234304E9, server swap
mem left:0.0.Etc message: start latency: 15@1,agg done@13760,compress
done@16562,server stats done@16562,
debugGitTag:cf4d2940b67d622eacd2ac9a913b221091a35c2e;.Normal Complete: true.
2016-08-04 00:47:54,068 DEBUG [pool-6-thread-1] util.CompressionUtils:67 :
Original: 46465726 bytes. Decompressed: 150553629 bytes. Time: 1294
2016-08-04 00:48:29,303 INFO  [pool-4-thread-1]
threadpool.DefaultScheduler:106 : Job Fetcher: 0 running, 0 actual running,
0 ready, 12 others
2016-08-04 00:48:31,990 INFO  [http-bio-7070-exec-7]
service.QueryService:399 : Scan count for each storageContext: 12306477,
2016-08-04 00:48:31,991 INFO  [http-bio-7070-exec-7]
controller.QueryController:197 : Stats of SQL response: isException: false,
duration: 56152, total scan count 12306477
2016-08-04 00:48:32,000 WARN  [http-bio-7070-exec-7]
sizeof.ObjectGraphWalker:209 : The configured limit of 1,000 object
references was reached while attempting to calculate the size of the object
graph. Severe performance degradation could occur if the sizing operation
continues. This can be avoided by setting the CacheManger or Cache
<sizeOfPolicy> elements maxDepthExceededBehavior to "abort" or adding stop
points with @IgnoreSizeOf annotations. If performance degradation is NOT an
issue at the configured limit, raise the limit value using the CacheManager
or Cache <sizeOfPolicy> elements maxDepth attribute. For more information,
see the Ehcache configuration documentation.
2016-08-04 00:48:32,091 INFO  [http-bio-7070-exec-7]
service.QueryService:250 :
==========================[QUERY]===============================
SQL: SELECT SUM(clicks) FROM hpa_reporting2 GROUP BY site_id, child_id,
search_type, hotel_id, report_date
User: ADMIN
Success: true
Duration: 56.152
Project: koddi_dev
Realization Names: [hpa_reporting2_cube_clone]
Cuboid Ids: [992]
Total scan count: 12306477
Result row count: 50000
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
==========================[QUERY]===============================


On Wed, Aug 3, 2016 at 8:38 PM, ShaoFeng Shi <[email protected]> wrote:

> Hi Jason, could you please provide the full log since sending query to and
> getting result back? The key information is which cuboid is used for the
> query, cuboid exact match or fuzzy match, how many records be scanned and
> how long it tooks; Thanks.
>
> 2016-08-03 23:19 GMT+08:00 Jason Hale <[email protected]>:
>
> > Yes, it would have to do post-aggregation in that case, but the strange
> > thing is that query was running fast (about 1 second), while queries with
> > more dimensions, such as "SELECT SUM(clicks) FROM reporting GROUP BY
> > site_id, child_id, report_date, hotel_id". This query will take about 106
> > seconds, but it shouldn't need to do any post-aggregation so I would
> think
> > it should return much quicker than that from the respective cuboid.
> >
> > Here's the explain plan:
> > OLAPToEnumerableConverter
> > OLAPProjectRel(EXPR$0=[$4])
> > OLAPAggregateRel(group=[{0, 1, 2, 3}], EXPR$0=[SUM($4)])
> > OLAPProjectRel(SITE_ID=[$9], CHILD_ID=[$3], REPORT_DATE=[$0],
> > HOTEL_ID=[$2], CLICKS=[$10])
> > OLAPTableScan(table=[[DEFAULT, HPA_REPORTING2]], fields=[[0, 1, 2, 3, 4,
> 5,
> > 6, 7, 8, 9, 10, 11]])
> >
> > On Tue, Aug 2, 2016 at 7:46 PM, ShaoFeng Shi <[email protected]>
> > wrote:
> >
> > > In the cube definition, you defined "SITE_ID", "CHILD_ID" as
> "Mandatory"
> > > dimension, which means they will not be aggregated in cube build phase
> > for
> > > all combinations.
> > >
> > > So when you run a query like  "SELECT SUM(clicks) FROM reporting GROUP
> BY
> > > search_type", Kylin will use the combination  "SITE_ID" + "CHILD_ID" +
> > > "SEARCH_TYPE" to serve, there will be post-aggregation in runtime; The
> > > performance is much depent on the cardinality of "SITE_ID" and
> > "CHILD_ID".
> > >
> > >
> > > 2016-08-02 23:08 GMT+08:00 Jason Hale <[email protected]>:
> > >
> > > > I've looked over the optimization options before, but did not notice
> > the
> > > > rowkey ordering. I can try this and see if this helps me. This is the
> > > only
> > > > thing I see that I can attempt to optimize further in the design, but
> > > I'll
> > > > provide my cube design below. I only have one measure to keep it
> > simple:
> > > >
> > > > {
> > > >   "uuid": "4090b854-8f0c-4288-bd73-fc50238a6030",
> > > >   "version": "1.5.2",
> > > >   "name": "hpa_reporting2_cube",
> > > >   "description": "",
> > > >   "dimensions": [
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.REPORT_DATE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "REPORT_DATE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.SEARCH_TYPE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "SEARCH_TYPE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.HOTEL_ID",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "HOTEL_ID",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.CHILD_ID",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "CHILD_ID",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.COUNTRY",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "COUNTRY",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.DEVICE_TYPE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "DEVICE_TYPE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.STAY_LENGTH",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "STAY_LENGTH",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.TRUE_RANK_AG",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "TRUE_RANK_AG",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.ROOM_BUNDLE",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "ROOM_BUNDLE",
> > > >       "derived": null
> > > >     },
> > > >     {
> > > >       "name": "DEFAULT.HPA_REPORTING2.SITE_ID",
> > > >       "table": "DEFAULT.HPA_REPORTING2",
> > > >       "column": "SITE_ID",
> > > >       "derived": null
> > > >     }
> > > >   ],
> > > >   "measures": [
> > > >     {
> > > >       "name": "_COUNT_",
> > > >       "function": {
> > > >         "expression": "COUNT",
> > > >         "parameter": {
> > > >           "type": "constant",
> > > >           "value": "1",
> > > >           "next_parameter": null
> > > >         },
> > > >         "returntype": "bigint"
> > > >       },
> > > >       "dependent_measure_ref": null
> > > >     },
> > > >     {
> > > >       "name": "CLICKS",
> > > >       "function": {
> > > >         "expression": "SUM",
> > > >         "parameter": {
> > > >           "type": "column",
> > > >           "value": "CLICKS",
> > > >           "next_parameter": null
> > > >         },
> > > >         "returntype": "decimal"
> > > >       },
> > > >       "dependent_measure_ref": null
> > > >     }
> > > >   ],
> > > >   "rowkey": {
> > > >     "rowkey_columns": [
> > > >       {
> > > >         "column": "REPORT_DATE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "SEARCH_TYPE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "HOTEL_ID",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "CHILD_ID",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "COUNTRY",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "DEVICE_TYPE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "STAY_LENGTH",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "TRUE_RANK_AG",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "ROOM_BUNDLE",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       },
> > > >       {
> > > >         "column": "SITE_ID",
> > > >         "encoding": "dict",
> > > >         "isShardBy": false
> > > >       }
> > > >     ]
> > > >   },
> > > >   "signature": "KixlKWxevr6mO+UlSaR5ig==",
> > > >   "last_modified": 1470070273935,
> > > >   "model_name": "hpa_reporting_model2",
> > > >   "null_string": null,
> > > >   "hbase_mapping": {
> > > >     "column_family": [
> > > >       {
> > > >         "name": "F1",
> > > >         "columns": [
> > > >           {
> > > >             "qualifier": "M",
> > > >             "measure_refs": [
> > > >               "_COUNT_",
> > > >               "CLICKS"
> > > >             ]
> > > >           }
> > > >         ]
> > > >       }
> > > >     ]
> > > >   },
> > > >   "aggregation_groups": [
> > > >     {
> > > >       "includes": [
> > > >         "REPORT_DATE",
> > > >         "SEARCH_TYPE",
> > > >         "HOTEL_ID",
> > > >         "CHILD_ID",
> > > >         "COUNTRY",
> > > >         "DEVICE_TYPE",
> > > >         "STAY_LENGTH",
> > > >         "TRUE_RANK_AG",
> > > >         "ROOM_BUNDLE",
> > > >         "SITE_ID"
> > > >       ],
> > > >       "select_rule": {
> > > >         "hierarchy_dims": [],
> > > >         "mandatory_dims": [
> > > >           "SITE_ID",
> > > >           "CHILD_ID"
> > > >         ],
> > > >         "joint_dims": [
> > > >           [
> > > >             "ROOM_BUNDLE",
> > > >             "TRUE_RANK_AG"
> > > >           ]
> > > >         ]
> > > >       }
> > > >     }
> > > >   ],
> > > >   "notify_list": [],
> > > >   "status_need_notify": [
> > > >     "ERROR",
> > > >     "DISCARDED",
> > > >     "SUCCEED"
> > > >   ],
> > > >   "partition_date_start": 0,
> > > >   "partition_date_end": 3153600000000,
> > > >   "auto_merge_time_ranges": [
> > > >     604800000,
> > > >     2419200000
> > > >   ],
> > > >   "retention_range": 0,
> > > >   "engine_type": 2,
> > > >   "storage_type": 2,
> > > >   "override_kylin_properties": {}
> > > > }
> > > >
> > > > On Mon, Aug 1, 2016 at 8:02 PM, ShaoFeng Shi <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > As Yiming mentioned, the cube design matters for the performance of
> > > both
> > > > > build and query; please check "Optimize Cube" in the document web
> > page
> > > > and
> > > > > do optimizaiton as much as possible;
> > > > >
> > > > > Besides, the cluster's capacity and Hadoop configuration is also an
> > > > > important factor; Try to identify the bottleneck and then optimize
> or
> > > add
> > > > > capacity.
> > > > >
> > > > > From 1.5 Kylin ships with two cubing algorithm; the steps "Build
> > > > > N-Dimension Cuboid" are the legacy "Layered" cubing algorithm; They
> > > will
> > > > be
> > > > > skipped when Kylin selects to use the new "Fast" cubing algorithm,
> > > which
> > > > is
> > > > > the "Build Cube" step after them. Please click the hadoop link in
> > that
> > > > step
> > > > > to inspect the MR job's statistics;
> > > > >
> > > > > Hope this helps to some extend;
> > > > >
> > > > >
> > > > >
> > > > > 2016-08-02 8:44 GMT+08:00 Yiming Liu <[email protected]>:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > Cube design is the performance key for Kylin, not only query, but
> > > also
> > > > > cube
> > > > > > building process. How to select dimensions, how to define the
> > > > > relationship
> > > > > > between dimensions, how to select encode method, how to define
> > > measure,
> > > > > > even how to choose the Hbase key order will have a significant
> > impact
> > > > on
> > > > > > performance.  There are quite a few wonderful documents
> introducing
> > > how
> > > > > to
> > > > > > do this, http://kylin.apache.org/docs15/ .
> > > > > >
> > > > > > One more thing, if you could share your cube design, you would
> get
> > > help
> > > > > > easier here.
> > > > > >
> > > > > > 2016-08-02 7:20 GMT+08:00 Jason Hale <[email protected]>:
> > > > > >
> > > > > > > I'm setting up a test case for a portion of our dataset, to
> > > evaluate
> > > > > > Kylin
> > > > > > > and I'm not seeing the performance that I would expect.
> > > > > > >
> > > > > > > The cube building process is taking about 5-6 hours with
> > > ~69,000,000
> > > > > > > records and 10 dimensions. I'm not sure if that's the expected
> > > build
> > > > > > time,
> > > > > > > but the other problem is the query performance after building
> the
> > > > cube.
> > > > > > >
> > > > > > > All queries were tested with a very simple query (e.g. SELECT
> > > > > SUM(clicks)
> > > > > > > FROM reporting GROUP BY search_type)
> > > > > > >
> > > > > > > Grouping by 1 or 2 dimensions gives me very responsive queries
> > > > (under 2
> > > > > > > seconds), but adding more dimensions drastically increases the
> > > query
> > > > > time
> > > > > > > (over 1 minute and it times out through hbase). I would expect
> > > these
> > > > > > > queries to have all similar query times since they should query
> > the
> > > > > > > respective cuboid, so I'm not sure why the performance would
> > > suffer.
> > > > I
> > > > > > > didn't set up any special rules for the cube, but during the
> > build
> > > > > > process
> > > > > > > it showed all the N-dimension cubes and the log simply said
> > > > 'skipped'.
> > > > > > >
> > > > > > > Is there something I'm missing in the configuration?
> > > > > > >
> > > > > > > I have a HDP cluster with 3 nodes and 1 client node on which
> > Kylin
> > > is
> > > > > > > installed. Do I need to adjust the hadoop configuration. I'm
> > using
> > > > most
> > > > > > of
> > > > > > > the default HDP settings.
> > > > > > >
> > > > > > > What more information can I provide?
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > With Warm regards
> > > > > >
> > > > > > Yiming Liu (刘一鸣)
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >
> > > > > Shaofeng Shi
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>

Re: Kylin Cube Performance

Reply via email to