[GitHub] [spark] tgravescs commented on pull request #29413: [SPARK-32597][CORE] Tune Event Drop in Async Event Queue

GitBox Thu, 13 Aug 2020 06:21:22 -0700


tgravescs commented on pull request #29413:
URL: https://github.com/apache/spark/pull/29413#issuecomment-673474159



   > We can think of making it per queue basis any such approach based on the 
user requirement like I have already done in this PR using the 
conf(spark.set.optmized.event.queue) for VariableLinkedBlockingQueue at the 
start of application.
   
   I don't get your point here. it can already be configured on a per queue 
basis (set "spark.scheduler.listenerbus.eventqueue.$name.capacity") and if user 
sets spark.set.optmized.event.queue why not just set 
spark.scheduler.listenerbus.eventqueue.$name.capacity to be larger to match 
whatever driver memory set at?
   I'm not necessarily against a change here, I get the issues with dropping 
events, but this just feels like extra code to do what user can already do. If 
I'm missing something and you have specific use case I'm missing, please 
explain in more detail.
   
   So the executor management queue is definitely a problem and if we keep 
seeing issues personally I think we should integration dynamic allocation 
manager into scheduler and get away from using messages.  The app status 
queues, I hadn't thought about things like zepplin using those messages, but 
that makes sense, in just a normal app they shouldn't affect normal application 
runs. 
   
   The only time I've seen issues with these queue is generally when dynamic 
allocation is enabled and you are running huge number of tasks and dynamic 
allocation is requests lots of executors and generally task time is low. 
Basically the startup point when we request executors, we get messages from 
them starting, we get task start and end messages all going at the same time.  
I'd be curious to hear what conditions are causing these for you. Part of it is 
to make sure we haven't regressed and something is taking a long time to 
process events. its not ideal but we have fixed certain components that did 
this and it has helped in the past.
   
   I don't really see a solution to this problem unless you can dynamic change 
it and reliable get the memory of the driver. The problem is (depending on GC 
and tunings being used) heap will use memory up to a certain point even if it 
didn't need to necessarily, getting those metrics I don't think is really 
reliable.  I know in the past things like the runtime stats don't increment all 
the time, it has to have a certain number allocated.  The problem with this 
also is that is heap is used a lot you can't dynamically change so then it goes 
back to user to increase heap size, which at the same time they might as well 
increase queue size.  So even this is not ideal.  
   
   Another thing we can do is if there are certain events critical we can look 
at putting them into its own queue as well.  Perhaps another question is, is 
your driver just running out of CPU to process these fast enough?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on pull request #29413: [SPARK-32597][CORE] Tune Event Drop in Async Event Queue

Reply via email to