[ 
https://issues.apache.org/jira/browse/KYLIN-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zongwei Li updated KYLIN-3672:
------------------------------
    Description: 
Hi, Kylin Team

We found one Kylin performance bug during performance tuning for our BI report 
integrate with Kylin.

 

+Background+

Our BI report show customer usage report to enterprise customers, provide 15 
usage charts in report page.

Each chart need send API request to Kylin with different SQLs. So it means for 
one user, it will trigger 15 API calls(by JDBC) to Kylin.

For our product scale, we need support at least 20 users to review the report 
at same time for each Kylin query node.

So it means each Kylin node should be able to handle 15 * 20 = 300 queries  per 
second.

 

+Performance Report+

To reduce the network impact. We built up Kylin cluster and testing machine in 
the same network with Hadoop system.

We use gatling and Jmeter tools to do several round testing, result as follow.

 
|Thread|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean 
Response Time
 (ms)|
|1|773|13|77|
|15|3245|54|279|
|25|3844|64|390|
|50|4912|82|612|
|75|5405|90|841|
|100|5436|91|1108|
|150|5434|91|1688|

 

And draw the trend chart as follow:

!TrendChartBeforeFix.png!

 

+Conclusion+

>From the trend, when the thread count reach 75, the handled queries per second 
>reaches peak data 90, and cannot improved by increase the thread count.

Each Kylin query engine can handle 90 queries per second, it means only support 
90/15 = 6 users to review report page at same time.

Even we setup 3 query nodes, can extend to 18 users at same time, this 
performance capacity cannot meet our business requirement.

 

+Analyze+

>From test result, response for one thread is fast, but as the thread increase, 
>throughput of Kylin not increased as we expected.

We have full code review for Kylin query engine, and use Jstack and JProfile to 
do analyze, found the root cause for this performance bottleneck.

This is one regression bug introduced by new feature involved one year before.

With bug fixing, one Kylin node can handle 350+ queries per second. Submit this 
bug for contribute patch to Kylin.

+Jstack Log Analyze+

We use Jstack to capture thread info during performance testing. Already attach 
one of them 'jstackBeforeBugFix.log'.

>From the log, we can found that 

One thread locked at sun.misc.URLClassPath.getNextLoader. TID is 
{color:#FF0000}*0x000000048007a180*{color}

 
{{"Query e9c44a2d-6226-ff3b-f984-ce8489107d79-3425"}} {{#}}{{3425}} {{daemon 
prio=}}{{5}} {{os_prio=}}{{0}} {{tid=}}{{0x000000000472b000}} 
{{nid=}}{{0x1433}} {{waiting }}{{for}} {{monitor entry 
[}}{{0x00007f272e40d000}}{{]}}
 
{{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
 
{{    }}{{at 
sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
 
{{    }}{{- locked <}}{{0x000000048007a180}}{{> (a sun.misc.URLClassPath)}}
 
{{    }}{{at sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
 
{{    }}{{at 
java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
 
{{    }}{{at 
java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
 
{{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
 
{{    }}{{at 
java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
 
{{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
 
{{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
 
{{    }}{{at 
org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
 
{{    }}{{at 
org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
 
43 threads waiting to lock <{color:#FF0000}*0x000000048007a180*{color}> 
 
{{"Query f1f0bbec-a3f7-04b2-1ac6-fd3e03a0232d-4002"}} {{#}}{{4002}} {{daemon 
prio=}}{{5}} {{os_prio=}}{{0}} {{tid=}}{{0x00007f27e71e7800}} 
{{nid=}}{{0x1676}} {{waiting }}{{for}} {{monitor entry 
[}}{{0x00007f279f503000}}{{]}}
{{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
{{    }}{{at 
sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
{{    }}{{- waiting to lock <}}{{0x000000048007a180}}{{> (a 
sun.misc.URLClassPath)}}
{{    }}{{at sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
{{    }}{{at 
java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
{{    }}{{at 
java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
{{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
{{    }}{{at 
java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
{{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
{{    }}{{at java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
{{    }}{{at 
org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
{{    }}{{at 
org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
+Kylin Server Info+
|*Role*|*vCPU*|*Memory(GB)*|*Volume(GB)*|
|Query Engine|16 (2.4G)|128|1024|

 

+Kylin Package+

apache-kylin-2.5.0-bin-cdh57.tar.gz (release package)

 

+Query SQL+

SQL with PreparedStatement cache enabled. (New feature in Kylin 2.5.0. If no 
PreparedStatement cache, performance will be more worse). Filter will hit all 6 
segments.

 

+Cube Info+

Segment Number: 6 Total Size: 47 MB

 

Segment: 20180101000000_20181011000000

Start Time: 2018-01-01 00:00:00

End Time: 2018-10-11 00:00:00

Source Count: 351934019

HBase Table: KYLIN_69Q9A850DZ

Region Count: 1

Size: 47 MB

 

Segment: 20181011000000_20181012000000

Start Time: 2018-10-11 00:00:00

End Time: 2018-10-12 00:00:00

Source Count: 7085485

HBase Table: KYLIN_ZCT39S8FUA

Region Count: 1

Size: less than 1 MB

 

 

Segment: 20181012000000_20181013000000

Start Time: 2018-10-12 00:00:00

End Time: 2018-10-13 00:00:00

Source Count: 5534968

HBase Table: KYLIN_RKRRLA958T

Region Count: 1

Size: less than 1 MB

 

Segment: 20181013000000_20181014000000

Start Time: 2018-10-13 00:00:00

End Time: 2018-10-14 00:00:00

Source Count: 242856

HBase Table: KYLIN_Q6DKCONN81

Region Count: 1

Size: less than 1 MB

 

Segment: 20181014000000_20181015000000

Start Time: 2018-10-14 00:00:00

End Time: 2018-10-15 00:00:00

Source Count: 236122

HBase Table: KYLIN_JY4WQD2MJH

Region Count: 1

Size: less than 1 MB

 

Segment: 20181015000000_20181016000000

Start Time: 2018-10-15 00:00:00

End Time: 2018-10-16 00:00:00

Source Count: 6172353

HBase Table: KYLIN_E2ELLINV22

Region Count: 1

Size: less than 1 MB

 

+HBase Region Server+ 

Count: 6

hbase.regionserver.handler.count: 120.

Not blocked in CoProcessor RPC call.

  was:
Hi, Kylin Team

We found one Kylin performance bug during performance tuning for our BI report 
integrate with Kylin.

 

+Background+

Our BI report show customer usage report to enterprise customers, provide 15 
usage charts in report page.

Each chart need send API request to Kylin with different SQLs. So it means for 
one user, it will trigger 15 API calls(by JDBC) to Kylin.

For our product scale, we need support at least 20 users to review the report 
at same time for each Kylin query node.

So it means each Kylin node should be able to handle 15 * 20 = 300 queries  per 
second.

 

+Performance Report+

To reduce the network impact. We built up Kylin cluster and testing machine in 
the same network with Hadoop system.

We use gatling and Jmeter tools to do several round testing, result as follow.

 
|Thread|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean 
Response Time
(ms)|
|1|773|13|77|
|15|3245|54|279|
|25|3844|64|390|
|50|4912|82|612|
|75|5405|90|841|
|100|5436|91|1108|
|150|5434|91|1688|

 

And draw the trend chart as follow:

!TrendChartBeforeFix.png!

 

+Conclusion+

>From the trend, when the thread count reach 75, the handled queries per second 
>reaches peak data 90, and cannot improved by increase the thread count.

Each Kylin query engine can handle 90 queries per second, it means only support 
90/15 = 6 users to review report page at same time.

Even we setup 3 query nodes, can extend to 18 users at same time, this 
performance capacity cannot meet our business requirement.

 

+Analyze+

>From test result, response for one thread is fast, but as the thread increase, 
>throughput of Kylin not increased as we expected.

We have full code review for Kylin query engine, and use Jstack and JProfile to 
do analyze, found the root cause for this performance bottleneck.

This is one regression bug introduced by new feature involved one year before.

With bug fixing, one Kylin node can handle 350+ queries per second. Submit this 
bug for contribute patch to Kylin.

 

+Kylin Server Info+
|*Role*|*vCPU*|*Memory(GB)*|*Volume(GB)*|
|Query Engine|16 (2.4G)|128|1024|

 

+Kylin Package+

apache-kylin-2.5.0-bin-cdh57.tar.gz (release package)

 

+Query SQL+

SQL with PreparedStatement cache enabled. (New feature in Kylin 2.5.0. If no 
PreparedStatement cache, performance will be more worse). Filter will hit all 6 
segments.

 

+Cube Info+

Segment Number: 6 Total Size: 47 MB

 

Segment: 20180101000000_20181011000000

Start Time: 2018-01-01 00:00:00

End Time: 2018-10-11 00:00:00

Source Count: 351934019

HBase Table: KYLIN_69Q9A850DZ

Region Count: 1

Size: 47 MB

 

Segment: 20181011000000_20181012000000

Start Time: 2018-10-11 00:00:00

End Time: 2018-10-12 00:00:00

Source Count: 7085485

HBase Table: KYLIN_ZCT39S8FUA

Region Count: 1

Size: less than 1 MB

 

 

Segment: 20181012000000_20181013000000

Start Time: 2018-10-12 00:00:00

End Time: 2018-10-13 00:00:00

Source Count: 5534968

HBase Table: KYLIN_RKRRLA958T

Region Count: 1

Size: less than 1 MB

 

Segment: 20181013000000_20181014000000

Start Time: 2018-10-13 00:00:00

End Time: 2018-10-14 00:00:00

Source Count: 242856

HBase Table: KYLIN_Q6DKCONN81

Region Count: 1

Size: less than 1 MB

 

Segment: 20181014000000_20181015000000

Start Time: 2018-10-14 00:00:00

End Time: 2018-10-15 00:00:00

Source Count: 236122

HBase Table: KYLIN_JY4WQD2MJH

Region Count: 1

Size: less than 1 MB

 

Segment: 20181015000000_20181016000000

Start Time: 2018-10-15 00:00:00

End Time: 2018-10-16 00:00:00

Source Count: 6172353

HBase Table: KYLIN_E2ELLINV22

Region Count: 1

Size: less than 1 MB

 

+HBase Region Server+ 

Count: 6

hbase.regionserver.handler.count: 120.

Not blocked in CoProcessor RPC call.


> Performance is poor when multiple queries occur in short period
> ---------------------------------------------------------------
>
>                 Key: KYLIN-3672
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3672
>             Project: Kylin
>          Issue Type: Bug
>          Components: Query Engine
>    Affects Versions: v2.5.0
>         Environment: CentOS 6.7, HBase 1.2.0+cdh5.14.2+456
>            Reporter: Zongwei Li
>            Assignee: Zongwei Li
>            Priority: Critical
>              Labels: patch, performance
>         Attachments: KYLIN-3672.master.001.patch, TrendChartBeforeFix.png, 
> jstackBeforeBugFix.log
>
>
> Hi, Kylin Team
> We found one Kylin performance bug during performance tuning for our BI 
> report integrate with Kylin.
>  
> +Background+
> Our BI report show customer usage report to enterprise customers, provide 15 
> usage charts in report page.
> Each chart need send API request to Kylin with different SQLs. So it means 
> for one user, it will trigger 15 API calls(by JDBC) to Kylin.
> For our product scale, we need support at least 20 users to review the report 
> at same time for each Kylin query node.
> So it means each Kylin node should be able to handle 15 * 20 = 300 queries  
> per second.
>  
> +Performance Report+
> To reduce the network impact. We built up Kylin cluster and testing machine 
> in the same network with Hadoop system.
> We use gatling and Jmeter tools to do several round testing, result as follow.
>  
> |Thread|Handled Queries (in 60 seconds)|Handled Queries (per second)|Mean 
> Response Time
>  (ms)|
> |1|773|13|77|
> |15|3245|54|279|
> |25|3844|64|390|
> |50|4912|82|612|
> |75|5405|90|841|
> |100|5436|91|1108|
> |150|5434|91|1688|
>  
> And draw the trend chart as follow:
> !TrendChartBeforeFix.png!
>  
> +Conclusion+
> From the trend, when the thread count reach 75, the handled queries per 
> second reaches peak data 90, and cannot improved by increase the thread count.
> Each Kylin query engine can handle 90 queries per second, it means only 
> support 90/15 = 6 users to review report page at same time.
> Even we setup 3 query nodes, can extend to 18 users at same time, this 
> performance capacity cannot meet our business requirement.
>  
> +Analyze+
> From test result, response for one thread is fast, but as the thread 
> increase, throughput of Kylin not increased as we expected.
> We have full code review for Kylin query engine, and use Jstack and JProfile 
> to do analyze, found the root cause for this performance bottleneck.
> This is one regression bug introduced by new feature involved one year before.
> With bug fixing, one Kylin node can handle 350+ queries per second. Submit 
> this bug for contribute patch to Kylin.
> +Jstack Log Analyze+
> We use Jstack to capture thread info during performance testing. Already 
> attach one of them 'jstackBeforeBugFix.log'.
> From the log, we can found that 
> One thread locked at sun.misc.URLClassPath.getNextLoader. TID is 
> {color:#FF0000}*0x000000048007a180*{color}
>  
> {{"Query e9c44a2d-6226-ff3b-f984-ce8489107d79-3425"}} {{#}}{{3425}} {{daemon 
> prio=}}{{5}} {{os_prio=}}{{0}} {{tid=}}{{0x000000000472b000}} 
> {{nid=}}{{0x1433}} {{waiting }}{{for}} {{monitor entry 
> [}}{{0x00007f272e40d000}}{{]}}
>  
> {{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
>  
> {{    }}{{at 
> sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
>  
> {{    }}{{- locked <}}{{0x000000048007a180}}{{> (a sun.misc.URLClassPath)}}
>  
> {{    }}{{at 
> sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
>  
> {{    }}{{at 
> java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
>  
> {{    }}{{at 
> java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
>  
> {{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
>  
> {{    }}{{at 
> java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
>  
> {{    }}{{at 
> java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
>  
> {{    }}{{at 
> java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
>  
> {{    }}{{at 
> org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
>  
> {{    }}{{at 
> org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
>  
> 43 threads waiting to lock <{color:#FF0000}*0x000000048007a180*{color}> 
>  
> {{"Query f1f0bbec-a3f7-04b2-1ac6-fd3e03a0232d-4002"}} {{#}}{{4002}} {{daemon 
> prio=}}{{5}} {{os_prio=}}{{0}} {{tid=}}{{0x00007f27e71e7800}} 
> {{nid=}}{{0x1676}} {{waiting }}{{for}} {{monitor entry 
> [}}{{0x00007f279f503000}}{{]}}
> {{   }}{{java.lang.Thread.State: BLOCKED (on object monitor)}}
> {{    }}{{at 
> sun.misc.URLClassPath.getNextLoader(URLClassPath.java:}}{{469}}{{)}}
> {{    }}{{- waiting to lock <}}{{0x000000048007a180}}{{> (a 
> sun.misc.URLClassPath)}}
> {{    }}{{at 
> sun.misc.URLClassPath.findResource(URLClassPath.java:}}{{214}}{{)}}
> {{    }}{{at 
> java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{569}}{{)}}
> {{    }}{{at 
> java.net.URLClassLoader$}}{{2}}{{.run(URLClassLoader.java:}}{{567}}{{)}}
> {{    }}{{at java.security.AccessController.doPrivileged(Native Method)}}
> {{    }}{{at 
> java.net.URLClassLoader.findResource(URLClassLoader.java:}}{{566}}{{)}}
> {{    }}{{at 
> java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1096}}{{)}}
> {{    }}{{at 
> java.lang.ClassLoader.getResource(ClassLoader.java:}}{{1091}}{{)}}
> {{    }}{{at 
> org.apache.catalina.loader.WebappClassLoaderBase.getResource(WebappClassLoaderBase.java:}}{{1666}}{{)}}
> {{    }}{{at 
> org.apache.kylin.common.KylinConfig.buildSiteOrderedProps(KylinConfig.java:}}{{338}}{{)}}
> +Kylin Server Info+
> |*Role*|*vCPU*|*Memory(GB)*|*Volume(GB)*|
> |Query Engine|16 (2.4G)|128|1024|
>  
> +Kylin Package+
> apache-kylin-2.5.0-bin-cdh57.tar.gz (release package)
>  
> +Query SQL+
> SQL with PreparedStatement cache enabled. (New feature in Kylin 2.5.0. If no 
> PreparedStatement cache, performance will be more worse). Filter will hit all 
> 6 segments.
>  
> +Cube Info+
> Segment Number: 6 Total Size: 47 MB
>  
> Segment: 20180101000000_20181011000000
> Start Time: 2018-01-01 00:00:00
> End Time: 2018-10-11 00:00:00
> Source Count: 351934019
> HBase Table: KYLIN_69Q9A850DZ
> Region Count: 1
> Size: 47 MB
>  
> Segment: 20181011000000_20181012000000
> Start Time: 2018-10-11 00:00:00
> End Time: 2018-10-12 00:00:00
> Source Count: 7085485
> HBase Table: KYLIN_ZCT39S8FUA
> Region Count: 1
> Size: less than 1 MB
>  
>  
> Segment: 20181012000000_20181013000000
> Start Time: 2018-10-12 00:00:00
> End Time: 2018-10-13 00:00:00
> Source Count: 5534968
> HBase Table: KYLIN_RKRRLA958T
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181013000000_20181014000000
> Start Time: 2018-10-13 00:00:00
> End Time: 2018-10-14 00:00:00
> Source Count: 242856
> HBase Table: KYLIN_Q6DKCONN81
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181014000000_20181015000000
> Start Time: 2018-10-14 00:00:00
> End Time: 2018-10-15 00:00:00
> Source Count: 236122
> HBase Table: KYLIN_JY4WQD2MJH
> Region Count: 1
> Size: less than 1 MB
>  
> Segment: 20181015000000_20181016000000
> Start Time: 2018-10-15 00:00:00
> End Time: 2018-10-16 00:00:00
> Source Count: 6172353
> HBase Table: KYLIN_E2ELLINV22
> Region Count: 1
> Size: less than 1 MB
>  
> +HBase Region Server+ 
> Count: 6
> hbase.regionserver.handler.count: 120.
> Not blocked in CoProcessor RPC call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to