Sam Huang created HUDI-943:
------------------------------
Summary: Slow performance observed when inserting data into Hudi
table
Key: HUDI-943
URL: https://issues.apache.org/jira/browse/HUDI-943
Project: Apache Hudi
Issue Type: Test
Components: Performance
Reporter: Sam Huang
I am using Datasource Writer API to write 5000 records into Hudi copy-on-write
table, each with 8 columns and the total size is less than 1Mb. Please refer to
the below code.
{code:java}
Dataset<Row> ds1 = spark.read().json(jsc.parallelize(records, 2));
DataFrameWriter<Row> writer = ds1.write().format("org.apache.hudi")
.option("hoodie.insert.shuffle.parallelism", 2)
.option("hoodie.upsert.shuffle.parallelism", 2)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY(),
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL())
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), recordKey)
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), partitionPath)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), precombineKey)
.option(HoodieWriteConfig.TABLE_NAME, hudiTableName)
.option(HoodieCompactionConfig.INLINE_COMPACT_PROP, hudiWriteCompactEnabled)
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), hudiTableName)
.option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), hiveDatabase)
.option(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), hiveServerUrl)
.option(DataSourceWriteOptions.HIVE_USER_OPT_KEY(), hiveUser)
.option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), hivePassword)
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), hiveSyncEnabled)
.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(), partitionPath)
.mode(SaveMode.Append);
writer.save(basePath);
{code}
At the beginning, it only takes 3~4 seconds to finish the insert operation, but
it gets longer and longer, say 30 seconds after 5 minutes.
>From the below spark logs, it shows that most of time is spent on
>HoodieSparkSqlWriter count task.
{noformat}
2020-05-25 16:36:37,851 | INFO | [dag-scheduler-event-loop] | Adding task set
185.0 with 1 tasks |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:36:37,851 | INFO | [dispatcher-event-loop-0] | Starting task 0.0
in stage 185.0 (TID 190, node-ana-corepOlf, executor 2, partition 0,
NODE_LOCAL, 7651 bytes) |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:36:37,858 | INFO | [dispatcher-event-loop-1] | Added
broadcast_124_piece0 in memory on node-ana-corepOlf:36554 (size: 138.1 KB,
free: 29.2 GB) |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:36:37,887 | INFO | [dispatcher-event-loop-1] | Asked to send map
output locations for shuffle 53 to 10.155.114.97:32461 |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:37:11,098 | INFO | [dispatcher-event-loop-0] | Added rdd_381_0
in memory on node-ana-corepOlf:36554 (size: 387.0 B, free: 29.2 GB) |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:37:11,111 | INFO | [task-result-getter-2] | Finished task 0.0 in
stage 185.0 (TID 190) in 33260 ms on node-ana-corepOlf (executor 2) (1/1) |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:37:11,111 | INFO | [task-result-getter-2] | Removed TaskSet
185.0, whose tasks have all completed, from pool |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:37:11,112 | INFO | [dag-scheduler-event-loop] | ResultStage 185
(count at HoodieSparkSqlWriter.scala:254) finished in 33.308 s |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
2020-05-25 16:37:11,113 | INFO | [Driver] | Job 70 finished: count at
HoodieSparkSqlWriter.scala:254, took 33.438673 s |
org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
{noformat}
I tried to tune the parameter hoodie.insert.shuffle.parallelism to 20, but did
not help. And the CPU/Heap usages are all normal.
Below is the setting for application.
Executor instance: 2
Executor memory: 55g
Executor cores: 4
Driver memory: 4g
--
This message was sent by Atlassian Jira
(v8.3.4#803005)