Zk timeouts question on ZkCmdExecutor (SOLR-14667)

Alex Deparvu Mon, 07 Aug 2023 11:12:42 -0700

Hi,

I would like to take a look at the Zk timeout configs, a lot of it is
currently hardcoded.
My plan is to have it based on a system property and go for some
consistency across the board.

There is one confusing bit of code I ran into: the ZkCmdExecutor [0]. this
needs the zk timeout only to compute the number of retries (and for those
having trouble doing the math this gives you 8 retries for 30000ms timeout).
First q: does this really need to be based on the timeout? I am thinking a
fixed number of retries would be sufficient, happy to be shown otherwise.

Second question: retryDelay has an interesting comment attached to it:
"retryDelay = 1500L; // 1 second would match timeout, so 500 ms over for
padding"
if this is the client timeout (which is currently 30s in some cases or 45s
in others) this no longer applies.
Example: for 30s timeout it will retry 7 times with a linear backoff delay
over 21sec, so it's not waiting enough for a timeout and causing 7x more
pain to the server if overloaded.
If my math is wrong, please correct, I may not have had enough coffee yet :)

Bonus question. It feels like a lot of the 'ensureExists' methods do not
really fit with an 'executor' style class (which is in reality a 'retry'
style class). Perhaps something like ZkMaintenanceUtils would be a better
place?

Any suggestions on what can be done with this class? I don't have enough
experience with the Zk integration to say what is best. I think "leave the
code as is" is also a very good option, it's not really blocking the Zk
timeout change I am lookin at.

best,
alex

[0]
https://github.com/apache/solr/blob/9c911e7337cd1026accc1a825e26906039982328/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ZkCmdExecutor.java#L47

Zk timeouts question on ZkCmdExecutor (SOLR-14667)

Reply via email to