rubenssoto commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663906344


   Hi bvaradar, thank you for your awnser.
   
   I tried to increase spark.yarn.executor.memoryOverhead to 2GB with 
foreachbatch option inside writestream and it worked. 4 nodes with 4 cores and 
32gb each, took 52 minutes is a good time for this hardware configuration? I 
think that could be better but Im very happy.
   Whats the difference with spark streaming with or without foreachbatch, Am I 
lost anything important? I tried, because I saw in delta lake docs, they use 
foreachbatch for merge in spark streaming.
   
   <img width="1680" alt="Captura de Tela 2020-07-25 às 18 04 27" 
src="https://user-images.githubusercontent.com/36298331/88466311-6b17cd80-cea1-11ea-9dbd-97753a2e6978.png";>
   <img width="1680" alt="Captura de Tela 2020-07-25 às 18 04 53" 
src="https://user-images.githubusercontent.com/36298331/88466313-6eab5480-cea1-11ea-8cb9-0e9a5c30b6c4.png";>
   <img width="1680" alt="Captura de Tela 2020-07-25 às 18 04 40" 
src="https://user-images.githubusercontent.com/36298331/88466316-70751800-cea1-11ea-8ec6-23bd69e51b17.png";>
   
   
   Some jobs took more time, do you know why some jobs created a lot of tasks? 
I think that could be more efficient if they write with fewer tasks. 
   Now I will try do the same thing with write operation "upsert" because my 
data set could have some duplicated values and I don't know what files are they.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to