I really don't think we should have types of roles. Not negative/positive and not strict/non-strict. You have a role or you don't. What that means is up to the code implementing the role.
Roles should be free to configure a preference order (binary, or n-ary or whatever, strict or loose), prohibit behavior, or enable behavior. In this SIP I feel we should focus on How to identify what node has what role, How to designate what roles a node has via config/params, and the API's for interacting with roles. We should for example be able to support roles such as PREFERRED_OVERSEER DATA NO_ROUTED_ALIAS (just an example, not something I mean to suggest) Details about role implementation should probably be discussed in a thread about that role. Obviously we should think about the name carefully to leave options open should we want to enhance things later so maybe OVERSEER_PREF or just OVERSEER would be better since it merely reades that the node implements some sort of preference or config regarding overseer... but all this can be decided on a per role basis On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <noble.p...@gmail.com> wrote: > Negative roles have a place > > Example is overseer > > There are 3 possible choices for that role > > a) preferred: always be in front of the election queue > b) on: not preferred, but can be an overseer if no preferred overseer > nodes are available > c) off: never become an overseer > > Today we only have options 'a' and 'b' . In a future ticket, we may > implement C > > On Fri, Dec 3, 2021, 11:59 AM Mike Drob <md...@mdrob.com> wrote: > >> Negative roles add a lot of complexity, I would really want to stay away >> from them. That’s why I want strict roles up front. It’s maybe ok to push >> this decision out, but it also seems like the sort of thing we should >> consider at the start. >> >> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <noble.p...@gmail.com> wrote: >> >>> Yes. Negative roles is not a bad idea. If I start a node for >>> machine learning purposes, I wouldn't want that node to ever participate in >>> overseer election >>> >>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <ilans...@gmail.com> wrote: >>> >>>> If we have non strict roles (like overseer), then it does make sense >>>> to have negative roles. >>>> That way I can define which are the two nodes that I'd prefer the >>>> overseer to run on, and a few other nodes on which it should >>>> definitely never run for various reasons. And in case these >>>> "!overseer" are the only nodes left in the cluster, let the cluster >>>> fail the same way it would if there were no data nodes available. >>>> >>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <houstonput...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> With the Strict/Loose option and sensible defaults, users cannot >>>> trip themselves up by default, but the option is there for people to tinker >>>> and have an iron grip over their cluster. >>>> >> >>>> >> >>>> >> +1 to sensible defaults so users don't trip themselves. The option >>>> to tinker for tighter grip can be tackled later, either on a per role basis >>>> or as a generic concept later. >>>> > >>>> > >>>> > +1 - Can definitely be added later if we so desire, not needed for >>>> this SIP >>>> > >>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya < >>>> ichattopadhy...@gmail.com> wrote: >>>> >> >>>> >> >>>> >> >>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <gus.h...@gmail.com> wrote: >>>> >>> >>>> >>> I think the key is to let the roles have full control of the >>>> implications of having/not having that role. No need for even a >>>> strict/loose designation. The question of do you have the role is yes/no >>>> with no logic to guess if the role is implied or not, The question of will >>>> it come up with the role is "have_explicit ? use_defaults : use_defaults. >>>> >>> >>>> >>> Once you figure out who has a role (or not) what that means is up >>>> to the role code. >>>> >>> >>>> >>> Corollary: we don't have to change the way overseer works in this >>>> SIP. We can rework it or not as we see fit separately. >>>> >> >>>> >> >>>> >> +1 >>>> >> >>>> >>> >>>> >>> >>>> >>> Only thing we need to do is find a wording that makes the above >>>> clear on first read through the SIP :) >>>> >>> >>>> >>> -Gus >>>> >>> >>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman < >>>> houstonput...@gmail.com> wrote: >>>> >>>>> >>>> >>>>> This doesn't really address my concern around what happens if all >>>> of our existing OVERSEER candidates are down. When at least one of them is >>>> up, the overseer will go there, and that is good and expected. But what >>>> happens if all of the overseer eligible nodes are down. Your comment, and >>>> the old system, would imply that the overseer election goes to some other >>>> unrelated, untagged node. I disagree with this implementation choice. This >>>> sounds like something role specific to determine, but I would like to see >>>> us be more strict about it. I don't want cores leaking out of my data >>>> roles, I don't want query processing to leak out of my "query" nodes or >>>> whatever. Overseer shouldn't be special in this regard. >>>> >>>> >>>> >>>> >>>> >>>> I'm very strongly in favor of not letting users design a system in >>>> which the cluster can be "live" without an overseer. I understand that the >>>> overseer can be taxing to the cluster, but honestly what is the point of >>>> having an untaxed cluster that doesn't have an overseer? I can see >>>> arguments for the other roles to be stricter about this, but there are also >>>> a lot of users who wouldn't want those to be strict either (like "query" >>>> nodes). >>>> >>>> >>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer >>>> role node HAS to be selected to become overseer, it will try to migrate the >>>> overseer job to a node with the overseer role whenever one becomes live. >>>> >>>> >>>> >>>> So maybe we don't have special rules per role, but instead roles >>>> can either be defined as "Strict" or "Loose" (better names likely exist), >>>> and the roles come with a default (Overseer -> Loose, Data -> Strict, Query >>>> -> Loose, etc.). And it is up to each role to define how to behave when >>>> running in LOOSE mode and a non-role node is used then a role node comes >>>> online (like the overseer example given above). >>>> >>>> >>>> >>>> With the Strict/Loose option and sensible defaults, users cannot >>>> trip themselves up by default, but the option is there for people to tinker >>>> and have an iron grip over their cluster. >>>> >>>> >>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <md...@mdrob.com> wrote: >>>> >>>>> >>>> >>>>> Noble wrote: >>>> >>>>> > We are not modifying the way the "overseer role" works today. >>>> We are just changing the definition and standardizing the configuration & >>>> discoverability >>>> >>>>> Ishan wrote: >>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER role >>>> (which currently stands for preferred overseer). We can take a stab at >>>> refactoring it later. >>>> >>>>> >>>> >>>>> Grouping these two comments together, since I think they are >>>> saying the same thing. I think this is part of my confusion. We have an old >>>> system that doesn't work the way we want the new system to work. There may >>>> be people already using the old system. What path do we offer for folks >>>> using the old system to migrate to the new system? What happens if somebody >>>> accidentally tries to use both systems at the same time? >>>> >>>>> >>>> >>>>> Ishan wrote: >>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER role] >>>> are live, Solr guarantees that one of those nodes becomes the overseer.", I >>>> meant to somewhat capture the current behaviour as the OVERSEER role >>>> performs today. Do you see any inconsistency with this statement vs. what >>>> it does today? >>>> >>>>> >>>> >>>>> This doesn't really address my concern around what happens if all >>>> of our existing OVERSEER candidates are down. When at least one of them is >>>> up, the overseer will go there, and that is good and expected. But what >>>> happens if all of the overseer eligible nodes are down. Your comment, and >>>> the old system, would imply that the overseer election goes to some other >>>> unrelated, untagged node. I disagree with this implementation choice. This >>>> sounds like something role specific to determine, but I would like to see >>>> us be more strict about it. I don't want cores leaking out of my data >>>> roles, I don't want query processing to leak out of my "query" nodes or >>>> whatever. Overseer shouldn't be special in this regard. >>>> >>>>> >>>> >>>>> Noble wrote: >>>> >>>>> > If we do that how do we know if xyz is a role or a node in the >>>> following request? >>>> >>>>> >>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's >>>> leave it as is. >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya < >>>> ichattopadhy...@gmail.com> wrote: >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <md...@mdrob.com> >>>> wrote: >>>> >>>>>>> >>>> >>>>>>> Replying to the top post in this thread because there has been >>>> a lot of discussion and I don't want to look like I'm continuing any of >>>> those particular threads. >>>> >>>>>>> >>>> >>>>>>> I finally had time to sit down and think about this with the >>>> attention it deserves and am generally happy with how the conversation has >>>> shaped the current proposal. >>>> >>>>>>> >>>> >>>>>>> GOOD: I think using system properties to define node roles is >>>> fine and I like that data is the default role when not defined. I think it >>>> is important to hold on to the guarantee that an active overseer will land >>>> on an overseer node role. >>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for folks >>>> using the current OVERSEER role. I am not sure that something can be done >>>> automatically since they need to now specify new properties at startup. >>>> Maybe we need to include loud warnings or support both approaches for a >>>> time? >>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer nodes >>>> fail, then it is implied the overseer will go to one of the data nodes. The >>>> specific wording in the SIP - "When one or more such nodes are live, Solr >>>> guarantees that one of those nodes become the overseer." implies to me that >>>> failover could go from overseer1 to overseer2 to overseerN to random node. >>>> I feel like we need to have some recording that there were dedicated >>>> overseer nodes and stop the cascading failure instead of churning through >>>> our data nodes. >>>> >>>>>>> >>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope of >>>> "coordinator" roles from a split query/indexing standpoint. I understand >>>> that these are used as examples, but would like stronger language that new >>>> roles should also go through their own SIP discussions. >>>> >>>>>>> >>>> >>>>>>> CLARIFICATION: I do not like that we are storing node liveness >>>> in two different places now. We have the live nodes and we have the node >>>> roles stored in two different places in zookeeper and it feels like this >>>> would lead to race conditions or split brain or other hard to diagnose bugs >>>> when those two lists don't agree with each other. This also feels like it >>>> contradicts the "single source of truth" idea later stated in the proposal. >>>> I see Gus's arguments for decoupling these and am not strongly opposed, I >>>> just get a lurking feeling about it. Even if we don't do this, I would like >>>> this called out explicitly in the alternative approaches section as >>>> something that we considered and rejected, with details why, >>>> >>>>>>> >>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional >>>> call out here that all operations are GET because nodes cannot be changed >>>> at runtime. >>>> >>>>>>> CLARIFICATION: How does this interact with the previous >>>> OVERSEER preference role? >>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of available >>>> roles for a cluster. I _think_ this could be based on the version that the >>>> cluster is running? Would be useful to be able to interrogate a cluster in >>>> the future... we're seeing OOM issues on queries, can we add some query >>>> nodes? When were they introduced? I don't know what path this API should >>>> exist at. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP >>>> document. Not sure if there's a better path that we could go for. >>>> >>>>>> >>>> >>>>>>> >>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which parts >>>> are string literals and which parts are meant to be substituted by the >>>> operator? GET /api/cluster/roles/data would become GET >>>> /api/cluster/roles/${rolename} in our SIP/documentation. >>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1 >>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate >>>> "nodes" >>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that >>>> intermediate "nodes" node. >>>> >>>>>>> >>>> >>>>>>> CLARIFICATION: Should listing roles require some permissions? >>>> Maybe this requirement is too fundamental to the operation of a cluster and >>>> everybody would have to be able to do it. >>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to >>>> treat roles? Implementation detail that the servers will figure out? Or >>>> strict guidance where the client needs to check where specific roles are >>>> before sending any further communication to the server? >>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that it >>>> can't fulfil? An overseer node gets a query or an update. A data node gets >>>> a collection creation request. Do they forward it on to an appropriate >>>> node, or do they reject it? Should this be configurable? If not, then it >>>> seems like lazy or poorly configured clients will defeat this isolation >>>> system quite easily. >>>> >>>>>>> >>>> >>>>>>> GOOD: Testing the API is very important, yes. >>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when >>>> roles are added mean? I thought we established that they are not dynamic. >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> Thanks, >>>> >>>>>>> Mike >>>> >>>>>>> >>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya < >>>> ichattopadhy...@gmail.com> wrote: >>>> >>>>>>>> >>>> >>>>>>>> Hi, >>>> >>>>>>>> >>>> >>>>>>>> Here's an SIP for introducing the concept of node roles: >>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>> >>>>>>>> >>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>> >>>>>>>> >>>> >>>>>>>> We also wish to add first class support for Query nodes that >>>> are used to process user queries by forwarding to data nodes, >>>> merging/aggregating them and presenting to users. This concept exists as >>>> first class citizens in most other search engines. This is a chance for >>>> Solr to catch up. >>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>> >>>>>>>> >>>> >>>>>>>> Regards, >>>> >>>>>>>> Ishan / Noble / Hitesh >>>> >>> >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> http://www.needhamsoftware.com (work) >>>> >>> http://www.the111shift.com (play) >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >>>> For additional commands, e-mail: dev-h...@solr.apache.org >>>> >>>> -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)