FeiZou opened a new issue #3418:
URL: https://github.com/apache/hudi/issues/3418
**Describe the problem you faced**
Hi there, I'm migrating a table from S3 data lake to Hudi data lake using
Spark. The source table data size is around `600 GB` and `8 B rows`, each
partition contains around `1.5G`B data and `20 M rows`. The target hudi table
is non partitioned and currently data size around `260 GB`. With `30 executors,
150 total cores, 32GB memory per executor` set up, it cost more than `3 hours`
to upsert one single partition into the Hudi table. If I reduce the executors
to `15`, it will end up failed with `No Space Left On Device` Error during
upserting. (We are using EC2 instance for spark worker which has 300GB EBS)
Hudi and Spark config I'm currently using is as below:
```val hudiOptions = Map[String,String](
HoodieWriteConfig.TABLE_NAME -> "hudi_recordings",
HoodieWriteConfig.WRITE_PAYLOAD_CLASS ->
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload",
DataSourceWriteOptions.OPERATION_OPT_KEY ->
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP -> "5",
HoodieCompactionConfig.MIN_COMMITS_TO_KEEP_PROP -> "10",
HoodieCompactionConfig.MAX_COMMITS_TO_KEEP_PROP-> "15",
HoodieIndexConfig.BLOOM_INDEX_FILTER_TYPE ->
BloomFilterTypeCode.DYNAMIC_V0.name(),
DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY ->
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "sid",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "date_updated")
val df =
spark.read.json(basePath).repartition(500).persist(StorageLevel.MEMORY_AND_DISK)
df.write.format("org.apache.hudi")
.options(hudiOptions)
.mode(SaveMode.Append)
.save(output_path)
```
I'm trying to reduce `hoodie.keep.min.commits` and
`hoodie.cleaner.commits.retained` in hoping it could reduce the data size .
Also I reduce `UPSERT_PARALLELISM` from `1500` to `500`.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a new Hudi table
2. Loading `1.5 GB` data which around `20 Million rows`
3. Using Hudi and spark config provide above
4. Running spark job with `30 executors, 150 total cores, 32GB memory per
executor`
**Expected behavior**
The upserting will take around 3 hours
**Environment Description**
* Hudi version : 0.7.0
* Spark version : 2.4.4
* Hive version : 2.3.5
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Additional context**
Please let me know what additional spark logs would be helpful and I can
provide them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]