If we go with no negative node roles and overseer node role is not strict (i.e. it’s a "preferred overseer"), then one would need to define a second node role "no_overseer" to explicitly exclude a node from ever becoming overseer (which I think is a useful feature until we switch the cluster default to not using the overseer), plus the implementation of these two node roles will obviously be coupled (and what if a node has both defined?).
I prefer strict node roles. Maybe we could have node roles with [optional] parameters to let the node role implementation decide ? The overseer node role for example could have one of 3 values defined for each node: “preferred” (default, equivalent to the existing overseer role), "accepted" (equivalent to currently not defining the overseer role) and "no_way" (does not exist today). This could be useful in other contexts. A node role “data” could be “fast” or “slow” depending on type of local persistent storage for example… Ilan On Fri 3 Dec 2021 at 16:10, Gus Heck <gus.h...@gmail.com> wrote: > I really don't think we should have types of roles. Not negative/positive > and not strict/non-strict. You have a role or you don't. What that means is > up to the code implementing the role. > > Roles should be free to configure a preference order (binary, or n-ary or > whatever, strict or loose), prohibit behavior, or enable behavior. In this > SIP I feel we should focus on How to identify what node has what role, How > to designate what roles a node has via config/params, and the API's for > interacting with roles. > > We should for example be able to support roles such as > > PREFERRED_OVERSEER > DATA > NO_ROUTED_ALIAS (just an example, not something I mean to suggest) > > Details about role implementation should probably be discussed in a thread > about that role. Obviously we should think about the name carefully to > leave options open should we want to enhance things later so maybe > > OVERSEER_PREF or just OVERSEER > > would be better since it merely reades that the node implements some sort > of preference or config regarding overseer... but all this can be decided > on a per role basis > > On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <noble.p...@gmail.com> wrote: > >> Negative roles have a place >> >> Example is overseer >> >> There are 3 possible choices for that role >> >> a) preferred: always be in front of the election queue >> b) on: not preferred, but can be an overseer if no preferred overseer >> nodes are available >> c) off: never become an overseer >> >> Today we only have options 'a' and 'b' . In a future ticket, we may >> implement C >> >> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <md...@mdrob.com> wrote: >> >>> Negative roles add a lot of complexity, I would really want to stay away >>> from them. That’s why I want strict roles up front. It’s maybe ok to push >>> this decision out, but it also seems like the sort of thing we should >>> consider at the start. >>> >>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <noble.p...@gmail.com> wrote: >>> >>>> Yes. Negative roles is not a bad idea. If I start a node for >>>> machine learning purposes, I wouldn't want that node to ever participate in >>>> overseer election >>>> >>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <ilans...@gmail.com> wrote: >>>> >>>>> If we have non strict roles (like overseer), then it does make sense >>>>> to have negative roles. >>>>> That way I can define which are the two nodes that I'd prefer the >>>>> overseer to run on, and a few other nodes on which it should >>>>> definitely never run for various reasons. And in case these >>>>> "!overseer" are the only nodes left in the cluster, let the cluster >>>>> fail the same way it would if there were no data nodes available. >>>>> >>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <houstonput...@gmail.com> >>>>> wrote: >>>>> >>> >>>>> >>> With the Strict/Loose option and sensible defaults, users cannot >>>>> trip themselves up by default, but the option is there for people to >>>>> tinker >>>>> and have an iron grip over their cluster. >>>>> >> >>>>> >> >>>>> >> +1 to sensible defaults so users don't trip themselves. The option >>>>> to tinker for tighter grip can be tackled later, either on a per role >>>>> basis >>>>> or as a generic concept later. >>>>> > >>>>> > >>>>> > +1 - Can definitely be added later if we so desire, not needed for >>>>> this SIP >>>>> > >>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya < >>>>> ichattopadhy...@gmail.com> wrote: >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <gus.h...@gmail.com> wrote: >>>>> >>> >>>>> >>> I think the key is to let the roles have full control of the >>>>> implications of having/not having that role. No need for even a >>>>> strict/loose designation. The question of do you have the role is yes/no >>>>> with no logic to guess if the role is implied or not, The question of will >>>>> it come up with the role is "have_explicit ? use_defaults : use_defaults. >>>>> >>> >>>>> >>> Once you figure out who has a role (or not) what that means is up >>>>> to the role code. >>>>> >>> >>>>> >>> Corollary: we don't have to change the way overseer works in this >>>>> SIP. We can rework it or not as we see fit separately. >>>>> >> >>>>> >> >>>>> >> +1 >>>>> >> >>>>> >>> >>>>> >>> >>>>> >>> Only thing we need to do is find a wording that makes the above >>>>> clear on first read through the SIP :) >>>>> >>> >>>>> >>> -Gus >>>>> >>> >>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman < >>>>> houstonput...@gmail.com> wrote: >>>>> >>>>> >>>>> >>>>> This doesn't really address my concern around what happens if >>>>> all of our existing OVERSEER candidates are down. When at least one of >>>>> them >>>>> is up, the overseer will go there, and that is good and expected. But what >>>>> happens if all of the overseer eligible nodes are down. Your comment, and >>>>> the old system, would imply that the overseer election goes to some other >>>>> unrelated, untagged node. I disagree with this implementation choice. This >>>>> sounds like something role specific to determine, but I would like to see >>>>> us be more strict about it. I don't want cores leaking out of my data >>>>> roles, I don't want query processing to leak out of my "query" nodes or >>>>> whatever. Overseer shouldn't be special in this regard. >>>>> >>>> >>>>> >>>> >>>>> >>>> I'm very strongly in favor of not letting users design a system >>>>> in which the cluster can be "live" without an overseer. I understand that >>>>> the overseer can be taxing to the cluster, but honestly what is the point >>>>> of having an untaxed cluster that doesn't have an overseer? I can see >>>>> arguments for the other roles to be stricter about this, but there are >>>>> also >>>>> a lot of users who wouldn't want those to be strict either (like "query" >>>>> nodes). >>>>> >>>> >>>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer >>>>> role node HAS to be selected to become overseer, it will try to migrate >>>>> the >>>>> overseer job to a node with the overseer role whenever one becomes live. >>>>> >>>> >>>>> >>>> So maybe we don't have special rules per role, but instead roles >>>>> can either be defined as "Strict" or "Loose" (better names likely exist), >>>>> and the roles come with a default (Overseer -> Loose, Data -> Strict, >>>>> Query >>>>> -> Loose, etc.). And it is up to each role to define how to behave when >>>>> running in LOOSE mode and a non-role node is used then a role node comes >>>>> online (like the overseer example given above). >>>>> >>>> >>>>> >>>> With the Strict/Loose option and sensible defaults, users cannot >>>>> trip themselves up by default, but the option is there for people to >>>>> tinker >>>>> and have an iron grip over their cluster. >>>>> >>>> >>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <md...@mdrob.com> wrote: >>>>> >>>>> >>>>> >>>>> Noble wrote: >>>>> >>>>> > We are not modifying the way the "overseer role" works today. >>>>> We are just changing the definition and standardizing the configuration & >>>>> discoverability >>>>> >>>>> Ishan wrote: >>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER role >>>>> (which currently stands for preferred overseer). We can take a stab at >>>>> refactoring it later. >>>>> >>>>> >>>>> >>>>> Grouping these two comments together, since I think they are >>>>> saying the same thing. I think this is part of my confusion. We have an >>>>> old >>>>> system that doesn't work the way we want the new system to work. There may >>>>> be people already using the old system. What path do we offer for folks >>>>> using the old system to migrate to the new system? What happens if >>>>> somebody >>>>> accidentally tries to use both systems at the same time? >>>>> >>>>> >>>>> >>>>> Ishan wrote: >>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER role] >>>>> are live, Solr guarantees that one of those nodes becomes the overseer.", >>>>> I >>>>> meant to somewhat capture the current behaviour as the OVERSEER role >>>>> performs today. Do you see any inconsistency with this statement vs. what >>>>> it does today? >>>>> >>>>> >>>>> >>>>> This doesn't really address my concern around what happens if >>>>> all of our existing OVERSEER candidates are down. When at least one of >>>>> them >>>>> is up, the overseer will go there, and that is good and expected. But what >>>>> happens if all of the overseer eligible nodes are down. Your comment, and >>>>> the old system, would imply that the overseer election goes to some other >>>>> unrelated, untagged node. I disagree with this implementation choice. This >>>>> sounds like something role specific to determine, but I would like to see >>>>> us be more strict about it. I don't want cores leaking out of my data >>>>> roles, I don't want query processing to leak out of my "query" nodes or >>>>> whatever. Overseer shouldn't be special in this regard. >>>>> >>>>> >>>>> >>>>> Noble wrote: >>>>> >>>>> > If we do that how do we know if xyz is a role or a node in the >>>>> following request? >>>>> >>>>> >>>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's >>>>> leave it as is. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >>>>> ichattopadhy...@gmail.com> wrote: >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <md...@mdrob.com> >>>>> wrote: >>>>> >>>>>>> >>>>> >>>>>>> Replying to the top post in this thread because there has been >>>>> a lot of discussion and I don't want to look like I'm continuing any of >>>>> those particular threads. >>>>> >>>>>>> >>>>> >>>>>>> I finally had time to sit down and think about this with the >>>>> attention it deserves and am generally happy with how the conversation has >>>>> shaped the current proposal. >>>>> >>>>>>> >>>>> >>>>>>> GOOD: I think using system properties to define node roles is >>>>> fine and I like that data is the default role when not defined. I think it >>>>> is important to hold on to the guarantee that an active overseer will land >>>>> on an overseer node role. >>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for folks >>>>> using the current OVERSEER role. I am not sure that something can be done >>>>> automatically since they need to now specify new properties at startup. >>>>> Maybe we need to include loud warnings or support both approaches for a >>>>> time? >>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer >>>>> nodes fail, then it is implied the overseer will go to one of the data >>>>> nodes. The specific wording in the SIP - "When one or more such nodes are >>>>> live, Solr guarantees that one of those nodes become the overseer." >>>>> implies >>>>> to me that failover could go from overseer1 to overseer2 to overseerN to >>>>> random node. I feel like we need to have some recording that there were >>>>> dedicated overseer nodes and stop the cascading failure instead of >>>>> churning >>>>> through our data nodes. >>>>> >>>>>>> >>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope of >>>>> "coordinator" roles from a split query/indexing standpoint. I understand >>>>> that these are used as examples, but would like stronger language that new >>>>> roles should also go through their own SIP discussions. >>>>> >>>>>>> >>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node liveness >>>>> in two different places now. We have the live nodes and we have the node >>>>> roles stored in two different places in zookeeper and it feels like this >>>>> would lead to race conditions or split brain or other hard to diagnose >>>>> bugs >>>>> when those two lists don't agree with each other. This also feels like it >>>>> contradicts the "single source of truth" idea later stated in the >>>>> proposal. >>>>> I see Gus's arguments for decoupling these and am not strongly opposed, I >>>>> just get a lurking feeling about it. Even if we don't do this, I would >>>>> like >>>>> this called out explicitly in the alternative approaches section as >>>>> something that we considered and rejected, with details why, >>>>> >>>>>>> >>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional >>>>> call out here that all operations are GET because nodes cannot be changed >>>>> at runtime. >>>>> >>>>>>> CLARIFICATION: How does this interact with the previous >>>>> OVERSEER preference role? >>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of available >>>>> roles for a cluster. I _think_ this could be based on the version that the >>>>> cluster is running? Would be useful to be able to interrogate a cluster in >>>>> the future... we're seeing OOM issues on queries, can we add some query >>>>> nodes? When were they introduced? I don't know what path this API should >>>>> exist at. >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP >>>>> document. Not sure if there's a better path that we could go for. >>>>> >>>>>> >>>>> >>>>>>> >>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which >>>>> parts are string literals and which parts are meant to be substituted by >>>>> the operator? GET /api/cluster/roles/data would become GET >>>>> /api/cluster/roles/${rolename} in our SIP/documentation. >>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1 >>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate >>>>> "nodes" >>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that >>>>> intermediate "nodes" node. >>>>> >>>>>>> >>>>> >>>>>>> CLARIFICATION: Should listing roles require some permissions? >>>>> Maybe this requirement is too fundamental to the operation of a cluster >>>>> and >>>>> everybody would have to be able to do it. >>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to >>>>> treat roles? Implementation detail that the servers will figure out? Or >>>>> strict guidance where the client needs to check where specific roles are >>>>> before sending any further communication to the server? >>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that it >>>>> can't fulfil? An overseer node gets a query or an update. A data node gets >>>>> a collection creation request. Do they forward it on to an appropriate >>>>> node, or do they reject it? Should this be configurable? If not, then it >>>>> seems like lazy or poorly configured clients will defeat this isolation >>>>> system quite easily. >>>>> >>>>>>> >>>>> >>>>>>> GOOD: Testing the API is very important, yes. >>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when >>>>> roles are added mean? I thought we established that they are not dynamic. >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> Thanks, >>>>> >>>>>>> Mike >>>>> >>>>>>> >>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>>> ichattopadhy...@gmail.com> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>> Hi, >>>>> >>>>>>>> >>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles: >>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>> >>>>>>>> >>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>> >>>>>>>> >>>>> >>>>>>>> We also wish to add first class support for Query nodes that >>>>> are used to process user queries by forwarding to data nodes, >>>>> merging/aggregating them and presenting to users. This concept exists as >>>>> first class citizens in most other search engines. This is a chance for >>>>> Solr to catch up. >>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>> >>>>>>>> >>>>> >>>>>>>> Regards, >>>>> >>>>>>>> Ishan / Noble / Hitesh >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> -- >>>>> >>> http://www.needhamsoftware.com (work) >>>>> >>> http://www.the111shift.com (play) >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >>>>> For additional commands, e-mail: dev-h...@solr.apache.org >>>>> >>>>> > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) >