Hi, I would like to take a look at the Zk timeout configs, a lot of it is currently hardcoded. My plan is to have it based on a system property and go for some consistency across the board.
There is one confusing bit of code I ran into: the ZkCmdExecutor [0]. this needs the zk timeout only to compute the number of retries (and for those having trouble doing the math this gives you 8 retries for 30000ms timeout). First q: does this really need to be based on the timeout? I am thinking a fixed number of retries would be sufficient, happy to be shown otherwise. Second question: retryDelay has an interesting comment attached to it: "retryDelay = 1500L; // 1 second would match timeout, so 500 ms over for padding" if this is the client timeout (which is currently 30s in some cases or 45s in others) this no longer applies. Example: for 30s timeout it will retry 7 times with a linear backoff delay over 21sec, so it's not waiting enough for a timeout and causing 7x more pain to the server if overloaded. If my math is wrong, please correct, I may not have had enough coffee yet :) Bonus question. It feels like a lot of the 'ensureExists' methods do not really fit with an 'executor' style class (which is in reality a 'retry' style class). Perhaps something like ZkMaintenanceUtils would be a better place? Any suggestions on what can be done with this class? I don't have enough experience with the Zk integration to say what is best. I think "leave the code as is" is also a very good option, it's not really blocking the Zk timeout change I am lookin at. best, alex [0] https://github.com/apache/solr/blob/9c911e7337cd1026accc1a825e26906039982328/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkCmdExecutor.java#L47