Re: [I] [SUPPORT] Slow commit times with Spark Structured Streaming from Kinesis to MOR Hudi table [hudi]

via GitHub Wed, 04 Dec 2024 10:27:42 -0800


sumosha commented on issue #12412:
URL: https://github.com/apache/hudi/issues/12412#issuecomment-2518205611


   @ad1happy2go It does appear there is spill even in the faster commit (which 
explains why that job seems to stay consistent at around 15 minutes). 
   
   <img width="1692" alt="faster_commit_spill_executors" 
src="https://github.com/user-attachments/assets/c72dfa84-f8e7-4f25-888d-5b36690086e6";>
   
   I haven't been able to recreate the disk spill in my stress testing, so I 
assume it is the size difference in the underlying table and files being 
written (production is around 100GB now, I started with a fresh table in 
testing and haven't built up a good size yet). I was planning to play around 
with this setting mentioned in your guide: `hoodie.memory.merge.fraction`. Does 
this seem the right track? I'm wondering if just a larger instance size is 
warranted as this grows (maybe fewer instances to get a comparable core count).
   
   
   We are currently on the default collector in EMR (Parallel) in production. I 
have updated to the G1 (this is jdk 17) in stress testing, though I didn't see 
much change in the overall commit times. We'll move forward with the G1 since 
it's recommended anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Slow commit times with Spark Structured Streaming from Kinesis to MOR Hudi table [hudi]

Reply via email to