I recommend the following format for the role spec roles=<role-name>:<role-value>
each role will have an enum of allowed values and a default value - role name: *data* - values: [*on*, *off]* - default: *allowed* - role name: *overseer* - values: [*allowed*, *disallowed*, *preferred]* - default : *allowed* - role name:* coordinator* - values : [*on*, *off]* - default: *off* examples roles=data:on,overseer:allowed (This is redundant because it uses all the default values. If a node is started without any roles value this is the default behavior) roles=data:off,overseer:preferred ( do not allow data, join overseer election at head) roles=coordinator:on,data:on (role as coordinator, but allow data, it's same as roles=coordinator:on) roles=coordinator:on,data:off (role as coordinator, disallow data) On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <ilans...@gmail.com> wrote: > If we go with no negative node roles and overseer node role is not strict > (i.e. it’s a "preferred overseer"), then one would need to define a second > node role "no_overseer" to explicitly exclude a node from ever becoming > overseer (which I think is a useful feature until we switch the cluster > default to not using the overseer), plus the implementation of these two > node roles will obviously be coupled (and what if a node has both defined?). > > I prefer strict node roles. > Maybe we could have node roles with [optional] parameters to let the node > role implementation decide ? > The overseer node role for example could have one of 3 values defined for > each node: “preferred” (default, equivalent to the existing overseer role), > "accepted" (equivalent to currently not defining the overseer role) and > "no_way" (does not exist today). > > This could be useful in other contexts. A node role “data” could be “fast” > or “slow” depending on type of local persistent storage for example… > > Ilan > > On Fri 3 Dec 2021 at 16:10, Gus Heck <gus.h...@gmail.com> wrote: > >> I really don't think we should have types of roles. Not negative/positive >> and not strict/non-strict. You have a role or you don't. What that means is >> up to the code implementing the role. >> >> Roles should be free to configure a preference order (binary, or n-ary or >> whatever, strict or loose), prohibit behavior, or enable behavior. In this >> SIP I feel we should focus on How to identify what node has what role, How >> to designate what roles a node has via config/params, and the API's for >> interacting with roles. >> >> We should for example be able to support roles such as >> >> PREFERRED_OVERSEER >> DATA >> NO_ROUTED_ALIAS (just an example, not something I mean to suggest) >> >> Details about role implementation should probably be discussed in a >> thread about that role. Obviously we should think about the name carefully >> to leave options open should we want to enhance things later so maybe >> >> OVERSEER_PREF or just OVERSEER >> >> would be better since it merely reades that the node implements some sort >> of preference or config regarding overseer... but all this can be decided >> on a per role basis >> >> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <noble.p...@gmail.com> wrote: >> >>> Negative roles have a place >>> >>> Example is overseer >>> >>> There are 3 possible choices for that role >>> >>> a) preferred: always be in front of the election queue >>> b) on: not preferred, but can be an overseer if no preferred overseer >>> nodes are available >>> c) off: never become an overseer >>> >>> Today we only have options 'a' and 'b' . In a future ticket, we may >>> implement C >>> >>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <md...@mdrob.com> wrote: >>> >>>> Negative roles add a lot of complexity, I would really want to stay >>>> away from them. That’s why I want strict roles up front. It’s maybe ok to >>>> push this decision out, but it also seems like the sort of thing we should >>>> consider at the start. >>>> >>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <noble.p...@gmail.com> wrote: >>>> >>>>> Yes. Negative roles is not a bad idea. If I start a node for >>>>> machine learning purposes, I wouldn't want that node to ever participate >>>>> in >>>>> overseer election >>>>> >>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <ilans...@gmail.com> wrote: >>>>> >>>>>> If we have non strict roles (like overseer), then it does make sense >>>>>> to have negative roles. >>>>>> That way I can define which are the two nodes that I'd prefer the >>>>>> overseer to run on, and a few other nodes on which it should >>>>>> definitely never run for various reasons. And in case these >>>>>> "!overseer" are the only nodes left in the cluster, let the cluster >>>>>> fail the same way it would if there were no data nodes available. >>>>>> >>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman < >>>>>> houstonput...@gmail.com> wrote: >>>>>> >>> >>>>>> >>> With the Strict/Loose option and sensible defaults, users cannot >>>>>> trip themselves up by default, but the option is there for people to >>>>>> tinker >>>>>> and have an iron grip over their cluster. >>>>>> >> >>>>>> >> >>>>>> >> +1 to sensible defaults so users don't trip themselves. The option >>>>>> to tinker for tighter grip can be tackled later, either on a per role >>>>>> basis >>>>>> or as a generic concept later. >>>>>> > >>>>>> > >>>>>> > +1 - Can definitely be added later if we so desire, not needed for >>>>>> this SIP >>>>>> > >>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya < >>>>>> ichattopadhy...@gmail.com> wrote: >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <gus.h...@gmail.com> >>>>>> wrote: >>>>>> >>> >>>>>> >>> I think the key is to let the roles have full control of the >>>>>> implications of having/not having that role. No need for even a >>>>>> strict/loose designation. The question of do you have the role is yes/no >>>>>> with no logic to guess if the role is implied or not, The question of >>>>>> will >>>>>> it come up with the role is "have_explicit ? use_defaults : use_defaults. >>>>>> >>> >>>>>> >>> Once you figure out who has a role (or not) what that means is up >>>>>> to the role code. >>>>>> >>> >>>>>> >>> Corollary: we don't have to change the way overseer works in this >>>>>> SIP. We can rework it or not as we see fit separately. >>>>>> >> >>>>>> >> >>>>>> >> +1 >>>>>> >> >>>>>> >>> >>>>>> >>> >>>>>> >>> Only thing we need to do is find a wording that makes the above >>>>>> clear on first read through the SIP :) >>>>>> >>> >>>>>> >>> -Gus >>>>>> >>> >>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman < >>>>>> houstonput...@gmail.com> wrote: >>>>>> >>>>> >>>>>> >>>>> This doesn't really address my concern around what happens if >>>>>> all of our existing OVERSEER candidates are down. When at least one of >>>>>> them >>>>>> is up, the overseer will go there, and that is good and expected. But >>>>>> what >>>>>> happens if all of the overseer eligible nodes are down. Your comment, and >>>>>> the old system, would imply that the overseer election goes to some other >>>>>> unrelated, untagged node. I disagree with this implementation choice. >>>>>> This >>>>>> sounds like something role specific to determine, but I would like to see >>>>>> us be more strict about it. I don't want cores leaking out of my data >>>>>> roles, I don't want query processing to leak out of my "query" nodes or >>>>>> whatever. Overseer shouldn't be special in this regard. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> I'm very strongly in favor of not letting users design a system >>>>>> in which the cluster can be "live" without an overseer. I understand that >>>>>> the overseer can be taxing to the cluster, but honestly what is the point >>>>>> of having an untaxed cluster that doesn't have an overseer? I can see >>>>>> arguments for the other roles to be stricter about this, but there are >>>>>> also >>>>>> a lot of users who wouldn't want those to be strict either (like "query" >>>>>> nodes). >>>>>> >>>> >>>>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer >>>>>> role node HAS to be selected to become overseer, it will try to migrate >>>>>> the >>>>>> overseer job to a node with the overseer role whenever one becomes live. >>>>>> >>>> >>>>>> >>>> So maybe we don't have special rules per role, but instead roles >>>>>> can either be defined as "Strict" or "Loose" (better names likely exist), >>>>>> and the roles come with a default (Overseer -> Loose, Data -> Strict, >>>>>> Query >>>>>> -> Loose, etc.). And it is up to each role to define how to behave when >>>>>> running in LOOSE mode and a non-role node is used then a role node comes >>>>>> online (like the overseer example given above). >>>>>> >>>> >>>>>> >>>> With the Strict/Loose option and sensible defaults, users cannot >>>>>> trip themselves up by default, but the option is there for people to >>>>>> tinker >>>>>> and have an iron grip over their cluster. >>>>>> >>>> >>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <md...@mdrob.com> >>>>>> wrote: >>>>>> >>>>> >>>>>> >>>>> Noble wrote: >>>>>> >>>>> > We are not modifying the way the "overseer role" works today. >>>>>> We are just changing the definition and standardizing the configuration & >>>>>> discoverability >>>>>> >>>>> Ishan wrote: >>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER >>>>>> role (which currently stands for preferred overseer). We can take a stab >>>>>> at >>>>>> refactoring it later. >>>>>> >>>>> >>>>>> >>>>> Grouping these two comments together, since I think they are >>>>>> saying the same thing. I think this is part of my confusion. We have an >>>>>> old >>>>>> system that doesn't work the way we want the new system to work. There >>>>>> may >>>>>> be people already using the old system. What path do we offer for folks >>>>>> using the old system to migrate to the new system? What happens if >>>>>> somebody >>>>>> accidentally tries to use both systems at the same time? >>>>>> >>>>> >>>>>> >>>>> Ishan wrote: >>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER >>>>>> role] are live, Solr guarantees that one of those nodes becomes the >>>>>> overseer.", I meant to somewhat capture the current behaviour as the >>>>>> OVERSEER role performs today. Do you see any inconsistency with this >>>>>> statement vs. what it does today? >>>>>> >>>>> >>>>>> >>>>> This doesn't really address my concern around what happens if >>>>>> all of our existing OVERSEER candidates are down. When at least one of >>>>>> them >>>>>> is up, the overseer will go there, and that is good and expected. But >>>>>> what >>>>>> happens if all of the overseer eligible nodes are down. Your comment, and >>>>>> the old system, would imply that the overseer election goes to some other >>>>>> unrelated, untagged node. I disagree with this implementation choice. >>>>>> This >>>>>> sounds like something role specific to determine, but I would like to see >>>>>> us be more strict about it. I don't want cores leaking out of my data >>>>>> roles, I don't want query processing to leak out of my "query" nodes or >>>>>> whatever. Overseer shouldn't be special in this regard. >>>>>> >>>>> >>>>>> >>>>> Noble wrote: >>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in >>>>>> the following request? >>>>>> >>>>> >>>>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's >>>>>> leave it as is. >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >>>>>> ichattopadhy...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <md...@mdrob.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Replying to the top post in this thread because there has >>>>>> been a lot of discussion and I don't want to look like I'm continuing any >>>>>> of those particular threads. >>>>>> >>>>>>> >>>>>> >>>>>>> I finally had time to sit down and think about this with the >>>>>> attention it deserves and am generally happy with how the conversation >>>>>> has >>>>>> shaped the current proposal. >>>>>> >>>>>>> >>>>>> >>>>>>> GOOD: I think using system properties to define node roles is >>>>>> fine and I like that data is the default role when not defined. I think >>>>>> it >>>>>> is important to hold on to the guarantee that an active overseer will >>>>>> land >>>>>> on an overseer node role. >>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for >>>>>> folks using the current OVERSEER role. I am not sure that something can >>>>>> be >>>>>> done automatically since they need to now specify new properties at >>>>>> startup. Maybe we need to include loud warnings or support both >>>>>> approaches >>>>>> for a time? >>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer >>>>>> nodes fail, then it is implied the overseer will go to one of the data >>>>>> nodes. The specific wording in the SIP - "When one or more such nodes are >>>>>> live, Solr guarantees that one of those nodes become the overseer." >>>>>> implies >>>>>> to me that failover could go from overseer1 to overseer2 to overseerN to >>>>>> random node. I feel like we need to have some recording that there were >>>>>> dedicated overseer nodes and stop the cascading failure instead of >>>>>> churning >>>>>> through our data nodes. >>>>>> >>>>>>> >>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope >>>>>> of "coordinator" roles from a split query/indexing standpoint. I >>>>>> understand >>>>>> that these are used as examples, but would like stronger language that >>>>>> new >>>>>> roles should also go through their own SIP discussions. >>>>>> >>>>>>> >>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node >>>>>> liveness in two different places now. We have the live nodes and we have >>>>>> the node roles stored in two different places in zookeeper and it feels >>>>>> like this would lead to race conditions or split brain or other hard to >>>>>> diagnose bugs when those two lists don't agree with each other. This also >>>>>> feels like it contradicts the "single source of truth" idea later stated >>>>>> in >>>>>> the proposal. I see Gus's arguments for decoupling these and am not >>>>>> strongly opposed, I just get a lurking feeling about it. Even if we don't >>>>>> do this, I would like this called out explicitly in the alternative >>>>>> approaches section as something that we considered and rejected, with >>>>>> details why, >>>>>> >>>>>>> >>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional >>>>>> call out here that all operations are GET because nodes cannot be changed >>>>>> at runtime. >>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous >>>>>> OVERSEER preference role? >>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of >>>>>> available roles for a cluster. I _think_ this could be based on the >>>>>> version >>>>>> that the cluster is running? Would be useful to be able to interrogate a >>>>>> cluster in the future... we're seeing OOM issues on queries, can we add >>>>>> some query nodes? When were they introduced? I don't know what path this >>>>>> API should exist at. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP >>>>>> document. Not sure if there's a better path that we could go for. >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which >>>>>> parts are string literals and which parts are meant to be substituted by >>>>>> the operator? GET /api/cluster/roles/data would become GET >>>>>> /api/cluster/roles/${rolename} in our SIP/documentation. >>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1 >>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate >>>>>> "nodes" >>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that >>>>>> intermediate "nodes" node. >>>>>> >>>>>>> >>>>>> >>>>>>> CLARIFICATION: Should listing roles require some permissions? >>>>>> Maybe this requirement is too fundamental to the operation of a cluster >>>>>> and >>>>>> everybody would have to be able to do it. >>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to >>>>>> treat roles? Implementation detail that the servers will figure out? Or >>>>>> strict guidance where the client needs to check where specific roles are >>>>>> before sending any further communication to the server? >>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that >>>>>> it can't fulfil? An overseer node gets a query or an update. A data node >>>>>> gets a collection creation request. Do they forward it on to an >>>>>> appropriate >>>>>> node, or do they reject it? Should this be configurable? If not, then it >>>>>> seems like lazy or poorly configured clients will defeat this isolation >>>>>> system quite easily. >>>>>> >>>>>>> >>>>>> >>>>>>> GOOD: Testing the API is very important, yes. >>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when >>>>>> roles are added mean? I thought we established that they are not dynamic. >>>>>> >>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> Thanks, >>>>>> >>>>>>> Mike >>>>>> >>>>>>> >>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>>>> ichattopadhy...@gmail.com> wrote: >>>>>> >>>>>>>> >>>>>> >>>>>>>> Hi, >>>>>> >>>>>>>> >>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles: >>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>>> >>>>>>>> >>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>>> >>>>>>>> >>>>>> >>>>>>>> We also wish to add first class support for Query nodes that >>>>>> are used to process user queries by forwarding to data nodes, >>>>>> merging/aggregating them and presenting to users. This concept exists as >>>>>> first class citizens in most other search engines. This is a chance for >>>>>> Solr to catch up. >>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>>> >>>>>>>> >>>>>> >>>>>>>> Regards, >>>>>> >>>>>>>> Ishan / Noble / Hitesh >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> -- >>>>>> >>> http://www.needhamsoftware.com (work) >>>>>> >>> http://www.the111shift.com (play) >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >>>>>> For additional commands, e-mail: dev-h...@solr.apache.org >>>>>> >>>>>> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >> > -- ----------------------------------------------------- Noble Paul