[GitHub] [hudi] vinothchandar opened a new issue #4135: [SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time.

GitBox Fri, 26 Nov 2021 19:44:41 -0800


vinothchandar opened a new issue #4135:
URL: https://github.com/apache/hudi/issues/4135



   **Describe the problem you faced**
   
   I am trying to play with z-ordering on a 50G+ dataset locally to understand 
everything. Noticed large number of stages, and its pretty slow due to that. I 
want to make sure this is expected. 
   
   
![image](https://user-images.githubusercontent.com/1179324/143666883-da6c64f2-9c1c-49fb-ae44-9f9a941f7116.png)
   
   
![image](https://user-images.githubusercontent.com/1179324/143666898-9b9350f6-fa76-4949-a358-bd064a60e7dc.png)
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Any 50GB+ dataset. I am using the amazon reviews dataset here  
https://s3.amazonaws.com/amazon-reviews-pds/readme.html 
   2. Run inline compaction 
   
   ```
   val df = spark.read.parquet(inputPath)
   val commonOpts = Map("hoodie.bulk_insert.shuffle.parallelism" -> "10",
                        "hoodie.clustering.inline" -> "true",
                        "hoodie.clustering.inline.max.commits" -> "1",
                        "hoodie.layout.optimize.enable" -> "true",
                        "hoodie.clustering.plan.strategy.sort.columns" -> 
"product_id,customer_id,review_date")
   df.write.format("hudi").
     option(PRECOMBINE_FIELD.key(), "review_id").
     option(RECORDKEY_FIELD.key(), "review_id").
     option("hoodie.table.name", "amazon_reviews_hudi").
     option(OPERATION.key(),"bulk_insert").
     option(BULK_INSERT_SORT_MODE.key(), "NONE").
     options(commonOpts).
     mode(Overwrite).
     save(outputPath)
   ```
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.10-SNAPSHOT
   
   * Spark version : Apache Spark 3.0 
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Local filesystem
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar opened a new issue #4135: [SUPPORT] Zordering clustering on a moderate size dataset taking large amounts of time.

Reply via email to