rubenssoto commented on issue #1878: URL: https://github.com/apache/hudi/issues/1878#issuecomment-663906344
Hi bvaradar, thank you for your awnser. I tried to increase spark.yarn.executor.memoryOverhead to 2GB with foreachbatch option inside writestream and it worked. 4 nodes with 4 cores and 32gb each, took 52 minutes is a good time for this hardware configuration? I think that could be better but Im very happy. Whats the difference with spark streaming with or without foreachbatch, Am I lost anything important? I tried, because I saw in delta lake docs, they use foreachbatch for merge in spark streaming. <img width="1680" alt="Captura de Tela 2020-07-25 às 18 04 27" src="https://user-images.githubusercontent.com/36298331/88466311-6b17cd80-cea1-11ea-9dbd-97753a2e6978.png"> <img width="1680" alt="Captura de Tela 2020-07-25 às 18 04 53" src="https://user-images.githubusercontent.com/36298331/88466313-6eab5480-cea1-11ea-8cb9-0e9a5c30b6c4.png"> <img width="1680" alt="Captura de Tela 2020-07-25 às 18 04 40" src="https://user-images.githubusercontent.com/36298331/88466316-70751800-cea1-11ea-8ec6-23bd69e51b17.png"> Some jobs took more time, do you know why some jobs created a lot of tasks? I think that could be more efficient if they write with fewer tasks. Now I will try do the same thing with write operation "upsert" because my data set could have some duplicated values and I don't know what files are they. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org