[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

GitBox Sat, 25 Jul 2020 14:09:07 -0700


rubenssoto commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-663906344

Hi bvaradar, thank you for your awnser.

I tried to increase spark.yarn.executor.memoryOverhead to 2GB with
foreachbatch option inside writestream and it worked. 4 nodes with 4 cores and
32gb each, took 52 minutes is a good time for this hardware configuration? I
think that could be better but Im very happy.
Whats the difference with spark streaming with or without foreachbatch, Am I
lost anything important? I tried, because I saw in delta lake docs, they use
foreachbatch for merge in spark streaming.

Some jobs took more time, do you know why some jobs created a lot of tasks?
I think that could be more efficient if they write with fewer tasks.
Now I will try do the same thing with write operation "upsert" because my
data set could have some duplicated values and I don't know what files are they.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

Reply via email to