I think I understand your scenario, at least I hope I made some progress...
The challenge is to coordinate river execution on independent nodes, especially during a critical cluster restart phase. Because river instances are autonomous by design, the result is unbalanced - the ES RiversRouter does not care how many river instances are "allowed" on a node, so it happily starts 10 rivers on a node that were previously on 5 nodes. To get the affinity problem addressed, I hope I can make progress on a "gatherer" framework that is able to distribute jobs among many nodes: https://github.com/jprante/elasticsearch-gatherer Each gatherer executes jobs by a thread pool. To avoid overloading jobs on a single gatherer node (e.g. in a critical cluster restart phase) I plan to let a reschedule thread check the cluster state for available gatherer nodes and their capacities (which can be the job queue length, or the system load). Jobs, once started, are owned by gatherer nodes, and added to a waiting queue. The queue can be examined, and jobs could also be reassigned to another gatherer with more available capacity. That is the basic idea. Sorry, I know the gatherer project does not help quickly to improve the current river installations. Just my 2c to let you know about one of my spare time activities ;) Jörg -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE%2BA8pNV3EdYdPLtgW%3DjU0p7_2gVFaUpYZFXhdVB%2BBkqg%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
