tgravescs commented on pull request #29413: URL: https://github.com/apache/spark/pull/29413#issuecomment-673474159
> We can think of making it per queue basis any such approach based on the user requirement like I have already done in this PR using the conf(spark.set.optmized.event.queue) for VariableLinkedBlockingQueue at the start of application. I don't get your point here. it can already be configured on a per queue basis (set "spark.scheduler.listenerbus.eventqueue.$name.capacity") and if user sets spark.set.optmized.event.queue why not just set spark.scheduler.listenerbus.eventqueue.$name.capacity to be larger to match whatever driver memory set at? I'm not necessarily against a change here, I get the issues with dropping events, but this just feels like extra code to do what user can already do. If I'm missing something and you have specific use case I'm missing, please explain in more detail. So the executor management queue is definitely a problem and if we keep seeing issues personally I think we should integration dynamic allocation manager into scheduler and get away from using messages. The app status queues, I hadn't thought about things like zepplin using those messages, but that makes sense, in just a normal app they shouldn't affect normal application runs. The only time I've seen issues with these queue is generally when dynamic allocation is enabled and you are running huge number of tasks and dynamic allocation is requests lots of executors and generally task time is low. Basically the startup point when we request executors, we get messages from them starting, we get task start and end messages all going at the same time. I'd be curious to hear what conditions are causing these for you. Part of it is to make sure we haven't regressed and something is taking a long time to process events. its not ideal but we have fixed certain components that did this and it has helped in the past. I don't really see a solution to this problem unless you can dynamic change it and reliable get the memory of the driver. The problem is (depending on GC and tunings being used) heap will use memory up to a certain point even if it didn't need to necessarily, getting those metrics I don't think is really reliable. I know in the past things like the runtime stats don't increment all the time, it has to have a certain number allocated. The problem with this also is that is heap is used a lot you can't dynamically change so then it goes back to user to increase heap size, which at the same time they might as well increase queue size. So even this is not ideal. Another thing we can do is if there are certain events critical we can look at putting them into its own queue as well. Perhaps another question is, is your driver just running out of CPU to process these fast enough? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
