[GitHub] [hudi] FeiZou opened a new issue #3418: [SUPPORT] Hudi Upsert Very Slow/ Failed With No Space Left on Device

GitBox Thu, 05 Aug 2021 13:50:05 -0700


FeiZou opened a new issue #3418:
URL: https://github.com/apache/hudi/issues/3418



   **Describe the problem you faced**
   
   Hi there, I'm migrating a table from S3 data lake to Hudi data lake using 
Spark. The source table data size is around `600 GB` and `8 B rows`, each 
partition contains around `1.5G`B data and `20 M rows`. The target hudi table 
is non partitioned and currently data size around `260 GB`. With `30 executors, 
150 total cores, 32GB memory per executor` set up, it cost more than `3 hours` 
to upsert one single partition into the Hudi table. If I reduce the executors 
to `15`, it will end up failed with `No Space Left On Device` Error during 
upserting. (We are using EC2 instance for spark worker which has 300GB EBS)
   
   Hudi and Spark config I'm currently using is as below:
   ```val hudiOptions = Map[String,String](
       HoodieWriteConfig.TABLE_NAME -> "hudi_recordings",
       HoodieWriteConfig.WRITE_PAYLOAD_CLASS -> 
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload",
       DataSourceWriteOptions.OPERATION_OPT_KEY -> 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
       HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP -> "5",
       HoodieCompactionConfig.MIN_COMMITS_TO_KEEP_PROP -> "10",
       HoodieCompactionConfig.MAX_COMMITS_TO_KEEP_PROP-> "15",
       HoodieIndexConfig.BLOOM_INDEX_FILTER_TYPE -> 
BloomFilterTypeCode.DYNAMIC_V0.name(),
       DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY -> 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
       DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
       DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "sid",
       DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "date_updated")
               
    val df = 
spark.read.json(basePath).repartition(500).persist(StorageLevel.MEMORY_AND_DISK)
    
    df.write.format("org.apache.hudi")
                 .options(hudiOptions)
                 .mode(SaveMode.Append)
                 .save(output_path)
   ```
   
   I'm trying to reduce `hoodie.keep.min.commits` and 
`hoodie.cleaner.commits.retained` in hoping it could reduce the data size . 
Also I reduce `UPSERT_PARALLELISM` from `1500` to `500`.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a new Hudi table
   2. Loading `1.5 GB` data which around `20 Million rows`
   3. Using Hudi and spark config provide above
   4. Running spark job with `30 executors, 150 total cores, 32GB memory per 
executor` 
   
   **Expected behavior**
   
   The upserting will take around 3 hours
   
   **Environment Description**
   
   * Hudi version : 0.7.0
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Please let me know what additional spark logs would be helpful and I can 
provide them.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] FeiZou opened a new issue #3418: [SUPPORT] Hudi Upsert Very Slow/ Failed With No Space Left on Device

Reply via email to