Rap70r commented on issue #3697:
URL: https://github.com/apache/hudi/issues/3697#issuecomment-924937292


   Hello @xushiyan, Thank you for getting back to me.
   Just a clarification that above data size (1714 Megabytes, 1.4 million 
records) is the usual incremental data size we expect on each upsert cycle. The 
total size of the entire data set sitting on S3 for this particular Hudi 
collection is 6.2 GB, with approximately 60 million records.
   We used to have around 230 partitions but the time that takes for 
"UpsertPartitioner" increases significantly, as each partition goes up to over 
100 MB. Considering this data size, what do you recommend as an ideal partition 
number?
   Also, do you recommend maybe increase the number of partitions to something 
like 5K and keep using the same instance type? Wouldn't that allow smaller 
instance types to handle small partitions faster? Or should we reduce the 
number of partitions and use larger instance type?
   For your second point, we use 25 Task instances of type c5.xlarge (4 vCore, 
8 GiB memory). Using above configs, we get around 20 executors. What would be 
the recommended instance type/size for this type of data size? I was under the 
impression C5 type are generally recommended for this type of work.
   And for your third point, we are using 3000 for parallelism 
(hoodie.upsert.shuffle.parallelism). Should we increase that?
   And finally, is there a way we can increase the number of files under each 
partitions? Would that help?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to