rohit-m-99 opened a new issue #3821:
URL: https://github.com/apache/hudi/issues/3821


   **Describe the problem you faced**
   Currently running Hudi 0.9.0 in production without a specific partition 
field. We are running using 6 workers each with 7 cores and 28GB of RAM. The 
files are stored in S3. 
   
   We run 50 `runs` each with about `4000` records. When then combine the runs 
into one dataframe, writing around 200k records at once using the `upsert` 
operation. Each record has around 280 columns. We see the majority of time 
being spent `GettingSmallFiles from partitions`.
   
   ![image 
(3)](https://user-images.githubusercontent.com/84733594/137831795-8f912112-3ae9-4412-afd7-1c1f688beb46.png)
   ![image 
(2)](https://user-images.githubusercontent.com/84733594/137831797-9922178f-94f5-4f73-9948-c4ae2988d21a.png)
   ![image 
(1)](https://user-images.githubusercontent.com/84733594/137831799-504a1153-8236-4cde-8bd8-220ec1a16753.png)
   
   * Hudi version : spark_hudi_0.9.0-SNAPSHOT
   
   * Spark version : 3.0.3
   
   * Hadoop version : 3.2.0
   
   * Storage (HDFS/S3/GCS..) : S#
   
   * Running on Docker? (yes/no) : K8S


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to