[ https://issues.apache.org/jira/browse/SOLR-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253820#comment-17253820 ]
Noble Paul commented on SOLR-15052: ----------------------------------- {quote}Then the {{R5}} update is also going to read the directory listing and execute. {quote} R5 would have gotten a callback and it would've updated the per-replica-states anyway. So, all that we are doing is an extra {{stat}} read , which is extremely cheap. {quote}With 500 children znodes, getChildren took on my laptop about 10-15ms while getData on a single file with equivalent amount of text took longer at ~20ms. This came as a surprise to me. {quote} Reads are not such a big deal. Even writes are not a big deal. But, CAS writes are a big deal. We would like to minimize contention while doing CAS writes. {quote}The multi operation (delete znode, create znode) took about 40ms while the CAS of the text file was faster at 30ms, {quote} CAS in itself is not slow. As the no:of of parallel writes grow, the performance degrades dramatically. If you have 1000's of replicas trying to update using CAS, the performance is going to be unacceptably low. Where as, the {{multi}} approach on individual nodes will perform same irrespective of whether we have 2 replicas or 20000 replicas. {quote}The implementation in the PR could easily avoid systematically re-reading the znode children list by attempting the multi operation on the cached PerReplicaStates of the DocCollection {quote} It already uses the cached data. Yes, it does an extra version check, but that's cheap > Reducing overseer bottlenecks using per-replica states > ------------------------------------------------------ > > Key: SOLR-15052 > URL: https://issues.apache.org/jira/browse/SOLR-15052 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Ishan Chattopadhyaya > Priority: Major > Attachments: per-replica-states-gcp.pdf > > Time Spent: 3h 40m > Remaining Estimate: 0h > > This work has the same goal as SOLR-13951, that is to reduce overseer > bottlenecks by avoiding replica state updates from going to the state.json > via the overseer. However, the approach taken here is different from > SOLR-13951 and hence this work supercedes that work. > The design proposed is here: > https://docs.google.com/document/d/1xdxpzUNmTZbk0vTMZqfen9R3ArdHokLITdiISBxCFUg/edit > Briefly, > # Every replica's state will be in a separate znode nested under the > state.json. It has the name that encodes the replica name, state, leadership > status. > # An additional children watcher to be set on state.json for state changes. > # Upon a state change, a ZK multi-op to delete the previous znode and add a > new znode with new state. > Differences between this and SOLR-13951, > # In SOLR-13951, we planned to leverage shard terms for per shard states. > # As a consequence, the code changes required for SOLR-13951 were massive (we > needed a shard state provider abstraction and introduce it everywhere in the > codebase). > # This approach is a drastically simpler change and design. > Credits for this design and the PR is due to [~noble.paul]. > [~markrmil...@gmail.com], [~noble.paul] and I have collaborated on this > effort. The reference branch takes a conceptually similar (but not identical) > approach. > I shall attach a PR and performance benchmarks shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org