[
https://issues.apache.org/jira/browse/KYLIN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245332#comment-17245332
]
ASF GitHub Bot commented on KYLIN-4818:
---------------------------------------
hit-lacus edited a comment on pull request #1485:
URL: https://github.com/apache/kylin/pull/1485#issuecomment-740020477
## CuboidStatisticsJob Profile Flame Graph
#### Tools
Refer to
https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector
.
#### Prepare env
- Hadoop env : HDP 2.4
- Cube : KylinSales (10000 lines)
- Commit : 2e13c8857700fd4d1c4e4daede6600562c62d494
#### Kylin Conf
```properties
kylin.metadata.url=KYLIN_4818_1@jdbc,url=jdbc:mysql://10.1.3.90:3306/NightlyBuild,username=root,password=R00t@kylin,maxActive=10,maxIdle=10
kylin.env.zookeeper-connect-string=cdh-master:2181
kylin.env.zookeeper-base-path=/kylin/regression_testing/KYLIN-4818-1
kylin.env.hdfs-working-dir=/kylin/regression_testing/KYLIN-4818-1
kylin.source.hive.database-for-flat-table=regression_testing
kylin.query.cache-enabled=false
kylin.job.scheduler.default=100
kylin.server.self-discovery-enabled=true
kylin.spark-conf.auto.prior=false
#kylin.cube.cubeplanner.enabled=false
kylin.engine.spark-conf.spark.executor.memory=6g
kylin.engine.spark-conf.spark.executor.memoryOverhead=1g
kylin.engine.spark-conf.spark.executor.instances=1
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-cmd=/usr/local/bin/spark-submit-flamegraph
kylin.cube.cubeplanner.enabled=true
```
#### Task Statistics Tab of Spark UI
<img width="1417" alt="image"
src="https://user-images.githubusercontent.com/14030549/101375345-3ce3d900-38ea-11eb-9a47-1f6bd16963ce.png">
#### Executor Log
```sh
LogType:stdout
Log Upload Time:Mon Dec 07 16:00:58 +0000 2020
LogLength:2920
Log Contents:
log4j: Trying to find [spark-executor-log4j.properties] using context
classloader sun.misc.Launcher$AppClassLoader@18b4aac2.
log4j: Using URL
[file:/hadoop/yarn/local/usercache/root/appcache/application_1606276600681_1970/container_e09_1606276600681_1970_01_000002/spark-executor-log4j.properties]
for automatic log4j configuration.
log4j: Reading configuration from URL
file:/hadoop/yarn/local/usercache/root/appcache/application_1606276600681_1970/container_e09_1606276600681_1970_01_000002/spark-executor-log4j.properties
log4j: Parsing for [root] with value=[INFO,stderr].
log4j: Level token is [INFO].
log4j: Category root set to INFO
log4j: Parsing appender named "stderr".
log4j: Parsing layout options for "stderr".
log4j: Setting property [conversionPattern] to [%d{ISO8601} %-5p [%t] %c{2}
: %m%n].
log4j: End of parsing for "stderr".
log4j: Setting property [target] to [System.err].
log4j: Parsed "stderr" options.
log4j: Finished configuring.
CuboidStatisticsJob-Init1-1607355948764
CuboidStatisticsJob-Init2-1607355948998
CuboidStatisticsJob-statisticsWithinPartition1-1607355949009
[10002313,10000349,0,2012-12-14,88750,Consumer Electronics,Vehicle
Electronics & GPS,Radar & Laser Detectors,1,2,FR,US,France,United
States,Others,0,ANALYST,Beijing]
[10004376,10000927,1,2012-08-28,175750,Home & Garden,Bedding,Blankets &
Throws,0,5,IT,FR,Italy,France,Others,0,ANALYST,Beijing]
[10006710,10000005,2,2012-02-16,148324,Phones,Mobile
Accessories,CaseCoverSkins,0,1,JP,CN,Japan,China,ABIN,15,ADMIN,Shanghai]
[10003717,10000209,3,2013-10-19,37831,Collectibles,Advertising,Merchandise &
Memorabilia,4,3,GB,FR,United Kingdom,France,FP-non GTC,0,ANALYST,Beijing]
[10006076,10000154,4,2012-10-22,140746,eBay Motors,Parts &
Accessories,Vintage Car & Truck
Parts,0,4,JP,FR,Japan,France,Others,100,ADMIN,Shanghai]
Stats
i :5001
meter1 :159
meter2 :279412
CuboidStatisticsJob-statisticsWithinPartition2-1607356229905
CuboidStatisticsJob-Init1-1607356230853
CuboidStatisticsJob-Init2-1607356231101
CuboidStatisticsJob-statisticsWithinPartition1-1607356231101
[10009393,10000949,5009,2012-09-06,51582,ClothinShoes & Accessories,Kids'
ClothinShoes & Accs,Girls' Clothing (Sizes 4 & Up),2,4,US,DE,United
States,Germany,FP-GTC,0,ADMIN,Shanghai]
[10002759,10000199,5010,2012-01-18,20865,ClothinShoes & Accessories,Men's
Clothing,Athletic Apparel,3,3,CN,FR,China,France,FP-GTC,0,ADMIN,Shanghai]
[10004825,10000098,5011,2013-04-25,20485,Home &
Garden,Furniture,Other,2,3,JP,JP,Japan,Japan,ABIN,0,ADMIN,Shanghai]
[10005962,10000244,5012,2013-12-01,145970,Toys & Hobbies,Models &
Kits,Automotive,5,4,JP,DE,Japan,Germany,FP-non GTC,0,ANALYST,Beijing]
[10004074,10000541,5013,2013-09-04,24541,Sports MeCards & Fan Shop,Fan
Apparel & Souvenirs,College-NCAA,2,2,FR,US,France,United
States,Auction,0,ADMIN,Shanghai]
Stats
i :4987
meter1 :93
meter2 :292977
CuboidStatisticsJob-statisticsWithinPartition2-1607356524809
End of LogType:stdout
```
### Flame graph
<img width="1196" alt="image"
src="https://user-images.githubusercontent.com/14030549/101375552-7f0d1a80-38ea-11eb-9c4b-29c04531899a.png">
<img width="1188" alt="image"
src="https://user-images.githubusercontent.com/14030549/101375669-a368f700-38ea-11eb-9f4d-ae6b5f57fece.png">
### Summary
From Spark UI, there are two task for CuboidStatisticsJob, first one has
`5001` input records, and cost about 4.7 minutes, that means each row costs
about **56.38** (4.7 * 60000 /5001) millseconds.
From executor log, And `meter2` is much larger than `meter1`.
From above flame graph indicate that `Long#toString` cost too much time.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Calculate cuboid statistics in Kylin 4
> --------------------------------------
>
> Key: KYLIN-4818
> URL: https://issues.apache.org/jira/browse/KYLIN-4818
> Project: Kylin
> Issue Type: Sub-task
> Components: Spark Engine
> Reporter: Xiaoxiang Yu
> Assignee: Xiaoxiang Yu
> Priority: Major
> Fix For: v4.0.0-beta
>
>
> Refer to SparkFactDistinct.java in Kylin 3, I will try to use spark to
> calculate(estimate) rowcount/size for cuboid candidate. Rowcount/size of
> cuboid si the input for cubeplanner phase one and phase two.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)