Hi Mike, Thanks for your thoughtful responses and the structured manner of your composing this mail (with clear labels as to your intent) is a refreshing change. I think we all should learn from this in future communication.
On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]> wrote: > Replying to the top post in this thread because there has been a lot of > discussion and I don't want to look like I'm continuing any of those > particular threads. > > I finally had time to sit down and think about this with the attention it > deserves and am generally happy with how the conversation has shaped the > current proposal. > > GOOD: I think using system properties to define node roles is fine and I > like that data is the default role when not defined. I think it is > important to hold on to the guarantee that an active overseer will land on > an overseer node role. > CHANGE REQUEST: I would like to see a migration path for folks using the > current OVERSEER role. I am not sure that something can be done > automatically since they need to now specify new properties at startup. > Maybe we need to include loud warnings or support both approaches for a > time? > As of this SIP, we're not planning to modify the OVERSEER role (which currently stands for preferred overseer). We can take a stab at refactoring it later. CHANGE REQUEST: I do not like that if all of the overseer nodes fail, then > it is implied the overseer will go to one of the data nodes. The specific > wording in the SIP - "When one or more such nodes are live, Solr guarantees > that one of those nodes become the overseer." implies to me that failover > could go from overseer1 to overseer2 to overseerN to random node. I feel > like we need to have some recording that there were dedicated overseer > nodes and stop the cascading failure instead of churning through our data > nodes. > When I wrote "When one or more such nodes [with OVERSEER role] are live, Solr guarantees that one of those nodes becomes the overseer.", I meant to somewhat capture the current behaviour as the OVERSEER role performs today. Do you see any inconsistency with this statement vs. what it does today? > > CLARIFICATION: I am slightly confused by the proposed scope of > "coordinator" roles from a split query/indexing standpoint. I understand > that these are used as examples, but would like stronger language that new > roles should also go through their own SIP discussions. > We are waiting for this roles SIP to conclude before starting a discussion around the coordinator role. At this point, I don't know if it should go via a SIP or not, but I agree in principle that it should go via thorough community vetting for design and implementation. I would personally prefer design discussions to happen in JIRA instead of a mailing list. > > CLARIFICATION: I do not like that we are storing node liveness in two > different places now. We have the live nodes and we have the node roles > stored in two different places in zookeeper and it feels like this would > lead to race conditions or split brain or other hard to diagnose bugs when > those two lists don't agree with each other. This also feels like it > contradicts the "single source of truth" idea later stated in the proposal. > I see Gus's arguments for decoupling these and am not strongly opposed, I > just get a lurking feeling about it. Even if we don't do this, I would like > this called out explicitly in the alternative approaches section as > something that we considered and rejected, with details why, > > GOOD: The API looks pretty clear. I would like an additional call out here > that all operations are GET because nodes cannot be changed at runtime. > +1, all these calls are GET and do not change the state of the system in any way. CLARIFICATION: How does this interact with the previous OVERSEER preference > role? > Except for deprecating the existing ADDROLE/REMOVEROLE runtime APIs, it does nothing to deal with current day OVERSEER role. We can change that definition/behaviour at a later point, if we think that's useful. > CHANGE REQUEST: An additional API to get the list of available roles for a > cluster. I _think_ this could be based on the version that the cluster is > running? Would be useful to be able to interrogate a cluster in the > future... we're seeing OOM issues on queries, can we add some query nodes? > When were they introduced? I don't know what path this API should exist at. > +1, this is a great suggestion. Will add to the proposal. > CLARIFICATION: Can we list the APIs to clearly show which parts are string > literals and which parts are meant to be substituted by the operator? > *GET **/api/cluster/roles/data *would become *GET > **/api/cluster/roles/${rolename} > *in our SIP/documentation. > CHANGE REQUEST: I think *GET /api/cluster/roles/nodes/node1* should be *GET > /api/cluster/roles/${nodename}* dropping the intermediate "nodes" > CHANGE REQUEST: The ZK structure also might not need that intermediate > "nodes" node. > > CLARIFICATION: Should listing roles require some permissions? Maybe this > requirement is too fundamental to the operation of a cluster and everybody > would have to be able to do it. > I have no opinion or preference on this. I don't see any security vulnerabilities in exposing the roles APIs to everyone. > CLARIFICATION: How do we expect SolrJ (and other clients) to treat roles? > Implementation detail that the servers will figure out? Or strict guidance > where the client needs to check where specific roles are before sending any > further communication to the server? > I think this would be dependent on the roles. For the coordinator role, I expect us to propose a way for the SolrJ client to be able to route the queries via coordinator nodes. Right now, I don't forsee the need for a strict guidance, since the internal representation of the roles in ZK and the roles API should suffice in supporting any modifications to SolrJ for future roles. > CLARIFICATION: What happens when a node gets a request that it can't > fulfil? An overseer node gets a query or an update. A data node gets a > collection creation request. Do they forward it on to an appropriate node, > or do they reject it? Should this be configurable? If not, then it seems > like lazy or poorly configured clients will defeat this isolation system > quite easily. > I think that would depend on the functionality exposed by the role. As an example, the coordinator nodes can, initially, support only query requests and in that case indexing requests to that server can either be forwarded or dropped (we can decide this collectively when we discuss the coordinator role and corresponding functionality). > > GOOD: Testing the API is very important, yes. > CLARIFICATION: What does testing for how nodes behave when roles are added > mean? I thought we established that they are not dynamic. > Yes, roles are not dynamic. Here's a preview for a test (more can be done in this test, and this test is out of date as it was written with a previous iteration of this SIP proposed design). https://github.com/apache/solr/blob/104cf95f983a9e86a2b85c81b211030beaa68134/solr/core/src/test/org/apache/solr/cloud/NodeRolesTest.java This is part of a draft WIP to implement this SIP: https://github.com/apache/solr/pull/403/files (out of date at the moment). Thanks, > Mike > > On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < > [email protected]> wrote: > >> Hi, >> >> Here's an SIP for introducing the concept of node roles: >> https://issues.apache.org/jira/browse/SOLR-15694 >> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >> >> We also wish to add first class support for Query nodes that are used to >> process user queries by forwarding to data nodes, merging/aggregating them >> and presenting to users. This concept exists as first class citizens in >> most other search engines. This is a chance for Solr to catch up. >> https://issues.apache.org/jira/browse/SOLR-15715 >> >> Regards, >> Ishan / Noble / Hitesh >> >
