[
https://issues.apache.org/jira/browse/HUDI-9044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936351#comment-17936351
]
Lin Liu commented on HUDI-9044:
-------------------------------
h3. HBase HFile Writer, Spark 3.5, Hudi 1.1.0-SNAHPSHOT
* Script
* ./bin/spark-submit \ --master yarn \ --deploy-mode client \ --driver-memory
30g \ --executor-memory 13g \ --num-executors 1 \ --executor-cores 8 \ --jars
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT-hbase.jar
\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf
spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf
spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf spark.kryoserializer.buffer=256m \ --conf
spark.kryoserializer.buffer.max=1024m \ --conf spark.rdd.compress=true \ --conf
spark.memory.storageFraction=0.8 \ --conf
"spark.driver.defaultJavaOptions=-XX:+UseG1GC" \ --conf
"spark.executor.defaultJavaOptions=-XX:+UseG1GC" \ --conf spark.ui.proxyBase=""
\ --conf 'spark.eventLog.enabled=true' --conf
'spark.eventLog.dir=hdfs:///var/log/spark/apps' \ --conf
spark.hadoop.yarn.timeline-service.enabled=false \ --conf
spark.driver.userClassPathFirst=true \ --conf
spark.executor.userClassPathFirst=true \ --conf
"spark.sql.hive.convertMetastoreParquet=false" \ --conf
spark.sql.catalogImplementation=in-memory \ --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\ --class io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark \
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/lake-plumber-1.0-SNAPSHOT.jar
\ --col_num 50 \ --record_num 10000
* 50 columns, 10k rows: *392 ms*
* Generate 10000 records into the memory Each record has 50 columns 406.658
ms/op Iteration 1: 388.718 ms/op Iteration 2: 390.299 ms/op Iteration 3:
391.731 ms/op Iteration 4: 392.857 ms/op Iteration 5: 394.294 ms/op Result
"io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark.writeHFileRecords":
391.580 ±(99.9%) 8.358 ms/op [Average] (min, avg, max) = (388.718, 391.580,
394.294), stdev = 2.171 CI (99.9%): [383.221, 399.938] (assumes normal
distribution) Benchmark (benchmarkRoot) (colNum) (confFile) (recordNum) (runId)
Mode Cnt Score Error Units SparkHFileWritingBenchmark.writeHFileRecords
file:///tmp/fs-benchmark 50 10000 run-1742256784432 avgt 5 392.619 ± 9.546 ms/op
* Script
* ./bin/spark-submit \ --master yarn \ --deploy-mode client \ --driver-memory
30g \ --executor-memory 13g \ --num-executors 1 \ --executor-cores 8 \ --jars
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT-hbase.jar
\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf
spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf
spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf spark.kryoserializer.buffer=256m \ --conf
spark.kryoserializer.buffer.max=1024m \ --conf spark.rdd.compress=true \ --conf
spark.memory.storageFraction=0.8 \ --conf
"spark.driver.defaultJavaOptions=-XX:+UseG1GC" \ --conf
"spark.executor.defaultJavaOptions=-XX:+UseG1GC" \ --conf spark.ui.proxyBase=""
\ --conf 'spark.eventLog.enabled=true' --conf
'spark.eventLog.dir=hdfs:///var/log/spark/apps' \ --conf
spark.hadoop.yarn.timeline-service.enabled=false \ --conf
spark.driver.userClassPathFirst=true \ --conf
spark.executor.userClassPathFirst=true \ --conf
"spark.sql.hive.convertMetastoreParquet=false" \ --conf
spark.sql.catalogImplementation=in-memory \ --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\ --class io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark \
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/lake-plumber-1.0-SNAPSHOT.jar
\ --col_num 50 \ --record_num 100000
* 50 columns, 100k rows: *2090 ms*
* Generate 100000 records into the memory Each record has 50 columns 2181.332
ms/op Iteration 1: 2080.036 ms/op Iteration 2: 2107.951 ms/op Iteration 3:
2104.332 ms/op Iteration 4: 2072.902 ms/op Iteration 5: 2086.728 ms/op Result
"io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark.writeHFileRecords":
2090.390 ±(99.9%) 58.689 ms/op [Average] (min, avg, max) = (2072.902,
2090.390, 2107.951), stdev = 15.241 CI (99.9%): [2031.701, 2149.079] (assumes
normal distribution) # Run complete. Total time: 00:01:11 REMEMBER: The numbers
below are just data. To gain reusable insights, you need to follow up on why
the numbers are the way they are. Use profilers (see -prof, -lprof), design
factorial experiments, perform baseline and negative tests that provide
experimental control, make sure the benchmarking environment is safe on
JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the
numbers tell you what you want them to tell. Benchmark (benchmarkRoot) (colNum)
(confFile) (recordNum) (runId) Mode Cnt Score Error Units
SparkHFileWritingBenchmark.writeHFileRecords file:///tmp/fs-benchmark 50 100000
run-1742257039549 avgt 5 2090.390 ± 58.689 ms/op
* Script
* ./bin/spark-submit \ --master yarn \ --deploy-mode client \ --driver-memory
30g \ --executor-memory 13g \ --num-executors 1 \ --executor-cores 8 \ --jars
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT-hbase.jar
\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf
spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf
spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf spark.kryoserializer.buffer=256m \ --conf
spark.kryoserializer.buffer.max=1024m \ --conf spark.rdd.compress=true \ --conf
spark.memory.storageFraction=0.8 \ --conf
"spark.driver.defaultJavaOptions=-XX:+UseG1GC" \ --conf
"spark.executor.defaultJavaOptions=-XX:+UseG1GC" \ --conf spark.ui.proxyBase=""
\ --conf 'spark.eventLog.enabled=true' --conf
'spark.eventLog.dir=hdfs:///var/log/spark/apps' \ --conf
spark.hadoop.yarn.timeline-service.enabled=false \ --conf
spark.driver.userClassPathFirst=true \ --conf
spark.executor.userClassPathFirst=true \ --conf
"spark.sql.hive.convertMetastoreParquet=false" \ --conf
spark.sql.catalogImplementation=in-memory \ --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\ --class io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark \
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/lake-plumber-1.0-SNAPSHOT.jar
\ --col_num 50 \ --record_num 1000000
* 50 columns, 1m rows: *19640 ms*
* Generate 1000000 records into the memory Each record has 50 columns
20017.346 ms/op Iteration 1: 19578.926 ms/op Iteration 2: 19600.253 ms/op
Iteration 3: 19725.853 ms/op Iteration 4: 19677.523 ms/op Iteration 5:
19619.000 ms/op Result
"io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark.writeHFileRecords":
19640.311 ±(99.9%) 232.072 ms/op [Average] (min, avg, max) = (19578.926,
19640.311, 19725.853), stdev = 60.268 CI (99.9%): [19408.239, 19872.383]
(assumes normal distribution) # Run complete. Total time: 00:02:16 REMEMBER:
The numbers below are just data. To gain reusable insights, you need to follow
up on why the numbers are the way they are. Use profilers (see -prof, -lprof),
design factorial experiments, perform baseline and negative tests that provide
experimental control, make sure the benchmarking environment is safe on
JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the
numbers tell you what you want them to tell. Benchmark (benchmarkRoot) (colNum)
(confFile) (recordNum) (runId) Mode Cnt Score Error Units
SparkHFileWritingBenchmark.writeHFileRecords file:///tmp/fs-benchmark 50
1000000 run-1742257232820 avgt 5 19640.311 ± 232.072 ms/op
h3. Native HFile Writer, Spark 3.5, Hudi 1.1.0-SNAPSHOT
* Script
* ./bin/spark-submit \ --master yarn \ --deploy-mode client \ --driver-memory
30g \ --executor-memory 13g \ --num-executors 1 \ --executor-cores 8 \ --jars
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT-native.jar
\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf
spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf
spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf spark.kryoserializer.buffer=256m \ --conf
spark.kryoserializer.buffer.max=1024m \ --conf spark.rdd.compress=true \ --conf
spark.memory.storageFraction=0.8 \ --conf
"spark.driver.defaultJavaOptions=-XX:+UseG1GC" \ --conf
"spark.executor.defaultJavaOptions=-XX:+UseG1GC" \ --conf spark.ui.proxyBase=""
\ --conf 'spark.eventLog.enabled=true' --conf
'spark.eventLog.dir=hdfs:///var/log/spark/apps' \ --conf
spark.hadoop.yarn.timeline-service.enabled=false \ --conf
spark.driver.userClassPathFirst=true \ --conf
spark.executor.userClassPathFirst=true \ --conf
"spark.sql.hive.convertMetastoreParquet=false" \ --conf
spark.sql.catalogImplementation=in-memory \ --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\ --class io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark \
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/lake-plumber-1.0-SNAPSHOT.jar
\ --col_num 50 \ --record_num 10000
* 50 columns, 10k rows: *388 ms*
* Generate 10000 records into the memory Each record has 50 columns 411.100
ms/op Iteration 1: 389.102 ms/op Iteration 2: 387.694 ms/op Iteration 3:
392.891 ms/op Iteration 4: 386.010 ms/op Iteration 5: 385.840 ms/op Result
"io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark.writeHFileRecords":
388.307 ±(99.9%) 11.124 ms/op [Average] (min, avg, max) = (385.840, 388.307,
392.891), stdev = 2.889 CI (99.9%): [377.184, 399.431] (assumes normal
distribution) # Run complete. Total time: 00:01:08 REMEMBER: The numbers below
are just data. To gain reusable insights, you need to follow up on why the
numbers are the way they are. Use profilers (see -prof, -lprof), design
factorial experiments, perform baseline and negative tests that provide
experimental control, make sure the benchmarking environment is safe on
JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the
numbers tell you what you want them to tell. Benchmark (benchmarkRoot) (colNum)
(confFile) (recordNum) (runId) Mode Cnt Score Error Units
SparkHFileWritingBenchmark.writeHFileRecords file:///tmp/fs-benchmark 50 10000
run-1742257607937 avgt 5 388.307 ± 11.124 ms/op
* Script
* ./bin/spark-submit \ --master yarn \ --deploy-mode client \ --driver-memory
30g \ --executor-memory 13g \ --num-executors 1 \ --executor-cores 8 \ --jars
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT-native.jar
\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf
spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf
spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf spark.kryoserializer.buffer=256m \ --conf
spark.kryoserializer.buffer.max=1024m \ --conf spark.rdd.compress=true \ --conf
spark.memory.storageFraction=0.8 \ --conf
"spark.driver.defaultJavaOptions=-XX:+UseG1GC" \ --conf
"spark.executor.defaultJavaOptions=-XX:+UseG1GC" \ --conf spark.ui.proxyBase=""
\ --conf 'spark.eventLog.enabled=true' --conf
'spark.eventLog.dir=hdfs:///var/log/spark/apps' \ --conf
spark.hadoop.yarn.timeline-service.enabled=false \ --conf
spark.driver.userClassPathFirst=true \ --conf
spark.executor.userClassPathFirst=true \ --conf
"spark.sql.hive.convertMetastoreParquet=false" \ --conf
spark.sql.catalogImplementation=in-memory \ --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\ --class io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark \
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/lake-plumber-1.0-SNAPSHOT.jar
\ --col_num 50 \ --record_num 100000
* 50 columns, 100k rows: *2109 ms*
* Generate 100000 records into the memory Each record has 50 columns 2194.148
ms/op Iteration 1: 2082.425 ms/op Iteration 2: 2116.891 ms/op Iteration 3:
2123.349 ms/op Iteration 4: 2107.630 ms/op Iteration 5: 2117.169 ms/op Result
"io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark.writeHFileRecords":
2109.493 ±(99.9%) 62.144 ms/op [Average] (min, avg, max) = (2082.425,
2109.493, 2123.349), stdev = 16.139 CI (99.9%): [2047.349, 2171.637] (assumes
normal distribution) # Run complete. Total time: 00:01:12 REMEMBER: The numbers
below are just data. To gain reusable insights, you need to follow up on why
the numbers are the way they are. Use profilers (see -prof, -lprof), design
factorial experiments, perform baseline and negative tests that provide
experimental control, make sure the benchmarking environment is safe on
JVM/OS/HW level, ask for reviews from the domain experts. Do not assume the
numbers tell you what you want them to tell. Benchmark (benchmarkRoot) (colNum)
(confFile) (recordNum) (runId) Mode Cnt Score Error Units
SparkHFileWritingBenchmark.writeHFileRecords file:///tmp/fs-benchmark 50 100000
run-1742257732601 avgt 5 2109.493 ± 62.144 ms/op
* Script
* ./bin/spark-submit \ --master yarn \ --deploy-mode client \ --driver-memory
30g \ --executor-memory 13g \ --num-executors 1 \ --executor-cores 8 \ --jars
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/hudi-spark3.5-bundle_2.12-1.1.0-SNAPSHOT-native.jar
\ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf
spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf
spark.executor.extraJavaOptions="-Dlog4j.configuration=file:/home/hadoop/warn.log4j.properties"
\ --conf spark.kryoserializer.buffer=256m \ --conf
spark.kryoserializer.buffer.max=1024m \ --conf spark.rdd.compress=true \ --conf
spark.memory.storageFraction=0.8 \ --conf
"spark.driver.defaultJavaOptions=-XX:+UseG1GC" \ --conf
"spark.executor.defaultJavaOptions=-XX:+UseG1GC" \ --conf spark.ui.proxyBase=""
\ --conf 'spark.eventLog.enabled=true' --conf
'spark.eventLog.dir=hdfs:///var/log/spark/apps' \ --conf
spark.hadoop.yarn.timeline-service.enabled=false \ --conf
spark.driver.userClassPathFirst=true \ --conf
spark.executor.userClassPathFirst=true \ --conf
"spark.sql.hive.convertMetastoreParquet=false" \ --conf
spark.sql.catalogImplementation=in-memory \ --conf
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
\ --class io.bytearray.benchmarks.createhandle.SparkHFileWritingBenchmark \
s3a://performance-benchmark-datasets-us-west-2/jenkins/benchmarks/input/sample_tables/test_case1/lake-plumber-1.0-SNAPSHOT.jar
\ --col_num 50 \ --record_num 1000000
* 50 columns, 1m rows: *19441 ms*
> Performance test
> ----------------
>
> Key: HUDI-9044
> URL: https://issues.apache.org/jira/browse/HUDI-9044
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: Lin Liu
> Assignee: Lin Liu
> Priority: Major
> Original Estimate: 5h
> Remaining Estimate: 5h
>
> Compare the performance comparing old and new HFileWriter.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)