Re: First class support for node roles

Ilan Ginzburg Sat, 04 Dec 2021 16:01:40 -0800

If we go with no negative node roles and overseer node role is not strict
(i.e. it’s a "preferred overseer"), then one would need to define a second
node role "no_overseer" to explicitly exclude a node from ever becoming
overseer (which I think is a useful feature until we switch the cluster
default to not using the overseer), plus the implementation of these two
node roles will obviously be coupled (and what if a node has both defined?).


I prefer strict node roles.
Maybe we could have node roles with [optional] parameters to let the node
role implementation decide ?
The overseer node role for example could have one of 3 values defined for
each node: “preferred” (default, equivalent to the existing overseer role),
"accepted" (equivalent to currently not defining the overseer role) and
"no_way" (does not exist today).

This could be useful in other contexts. A node role “data” could be “fast”
or “slow” depending on type of local persistent storage for example…

Ilan

On Fri 3 Dec 2021 at 16:10, Gus Heck <gus.h...@gmail.com> wrote:

> I really don't think we should have types of roles. Not negative/positive
> and not strict/non-strict. You have a role or you don't. What that means is
> up to the code implementing the role.
>
> Roles should be free to configure a preference order (binary, or n-ary or
> whatever, strict or loose), prohibit behavior, or enable behavior. In this
> SIP I feel we should focus on How to identify what node has what role, How
> to designate what roles a node has via config/params, and the API's for
> interacting with roles.
>
> We should for example be able to support roles such as
>
> PREFERRED_OVERSEER
> DATA
> NO_ROUTED_ALIAS  (just an example, not something I mean to suggest)
>
> Details about role implementation should probably be discussed in a thread
> about that role.  Obviously we should think about the name carefully to
> leave options open should we want to enhance things later so maybe
>
> OVERSEER_PREF  or just  OVERSEER
>
> would be better since it merely reades that the node implements some sort
> of preference or config regarding overseer... but all this can be decided
> on a per role basis
>
> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <noble.p...@gmail.com> wrote:
>
>> Negative roles have a place
>>
>> Example is overseer
>>
>> There are 3 possible choices for that role
>>
>> a) preferred: always be in front of the election queue
>> b) on: not preferred, but can be an overseer if no preferred overseer
>> nodes are available
>> c) off: never become an overseer
>>
>> Today we only have options 'a' and 'b' . In a future ticket, we may
>> implement C
>>
>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <md...@mdrob.com> wrote:
>>
>>> Negative roles add a lot of complexity, I would really want to stay away
>>> from them. That’s why I want strict roles up front. It’s maybe ok to push
>>> this decision out, but it also seems like the sort of thing we should
>>> consider at the start.
>>>
>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <noble.p...@gmail.com> wrote:
>>>
>>>> Yes. Negative roles is not a bad idea. If I start a node for
>>>> machine learning purposes, I wouldn't want that node to ever participate in
>>>> overseer election
>>>>
>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <ilans...@gmail.com> wrote:
>>>>
>>>>> If we have non strict roles (like overseer), then it does make sense
>>>>> to have negative roles.
>>>>> That way I can define which are the two nodes that I'd prefer the
>>>>> overseer to run on, and a few other nodes on which it should
>>>>> definitely never run for various reasons. And in case these
>>>>> "!overseer" are the only nodes left in the cluster, let the cluster
>>>>> fail the same way it would if there were no data nodes available.
>>>>>
>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <houstonput...@gmail.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> With the Strict/Loose option and sensible defaults, users cannot
>>>>> trip themselves up by default, but the option is there for people to 
>>>>> tinker
>>>>> and have an iron grip over their cluster.
>>>>> >>
>>>>> >>
>>>>> >> +1 to sensible defaults so users don't trip themselves. The option
>>>>> to tinker for tighter grip can be tackled later, either on a per role 
>>>>> basis
>>>>> or as a generic concept later.
>>>>> >
>>>>> >
>>>>> > +1 - Can definitely be added later if we so desire, not needed for
>>>>> this SIP
>>>>> >
>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
>>>>> ichattopadhy...@gmail.com> wrote:
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <gus.h...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> I think the key  is to let the roles have full control of the
>>>>> implications of having/not having that role. No need for even a
>>>>> strict/loose designation. The question of do you have the role is yes/no
>>>>> with no logic to guess if the role is implied or not, The question of will
>>>>> it come up with the role is "have_explicit ? use_defaults : use_defaults.
>>>>> >>>
>>>>> >>> Once you figure out who has a role (or not) what that means is up
>>>>> to the role code.
>>>>> >>>
>>>>> >>> Corollary: we don't have to change the way overseer works in this
>>>>> SIP. We can rework it or not as we see fit separately.
>>>>> >>
>>>>> >>
>>>>> >> +1
>>>>> >>
>>>>> >>>
>>>>> >>>
>>>>> >>> Only thing we need to do is find a wording that makes the above
>>>>> clear on first read through the SIP :)
>>>>> >>>
>>>>> >>> -Gus
>>>>> >>>
>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <
>>>>> houstonput...@gmail.com> wrote:
>>>>> >>>>>
>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>> them
>>>>> is up, the overseer will go there, and that is good and expected. But what
>>>>> happens if all of the overseer eligible nodes are down. Your comment, and
>>>>> the old system, would imply that the overseer election goes to some other
>>>>> unrelated, untagged node. I disagree with this implementation choice. This
>>>>> sounds like something role specific to determine, but I would like to see
>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> I'm very strongly in favor of not letting users design a system
>>>>> in which the cluster can be "live" without an overseer. I understand that
>>>>> the overseer can be taxing to the cluster, but honestly what is the point
>>>>> of having an untaxed cluster that doesn't have an overseer? I can see
>>>>> arguments for the other roles to be stricter about this, but there are 
>>>>> also
>>>>> a lot of users who wouldn't want those to be strict either (like "query"
>>>>> nodes).
>>>>> >>>>
>>>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer
>>>>> role node HAS to be selected to become overseer, it will try to migrate 
>>>>> the
>>>>> overseer job to a node with the overseer role whenever one becomes live.
>>>>> >>>>
>>>>> >>>> So maybe we don't have special rules per role, but instead roles
>>>>> can either be defined as "Strict" or "Loose" (better names likely exist),
>>>>> and the roles come with a default (Overseer -> Loose, Data -> Strict, 
>>>>> Query
>>>>> -> Loose, etc.). And it is up to each role to define how to behave when
>>>>> running in LOOSE mode and a non-role node is used then a role node comes
>>>>> online (like the overseer example given above).
>>>>> >>>>
>>>>> >>>> With the Strict/Loose option and sensible defaults, users cannot
>>>>> trip themselves up by default, but the option is there for people to 
>>>>> tinker
>>>>> and have an iron grip over their cluster.
>>>>> >>>>
>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <md...@mdrob.com> wrote:
>>>>> >>>>>
>>>>> >>>>> Noble wrote:
>>>>> >>>>> > We are not modifying the way the "overseer role" works today.
>>>>> We are just changing the definition and standardizing the configuration &
>>>>> discoverability
>>>>> >>>>> Ishan wrote:
>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER role
>>>>> (which currently stands for preferred overseer). We can take a stab at
>>>>> refactoring it later.
>>>>> >>>>>
>>>>> >>>>> Grouping these two comments together, since I think they are
>>>>> saying the same thing. I think this is part of my confusion. We have an 
>>>>> old
>>>>> system that doesn't work the way we want the new system to work. There may
>>>>> be people already using the old system. What path do we offer for folks
>>>>> using the old system to migrate to the new system? What happens if 
>>>>> somebody
>>>>> accidentally tries to use both systems at the same time?
>>>>> >>>>>
>>>>> >>>>> Ishan wrote:
>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER role]
>>>>> are live, Solr guarantees that one of those nodes becomes the overseer.", 
>>>>> I
>>>>> meant to somewhat capture the current behaviour as the OVERSEER role
>>>>> performs today. Do you see any inconsistency with this statement vs. what
>>>>> it does today?
>>>>> >>>>>
>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>> them
>>>>> is up, the overseer will go there, and that is good and expected. But what
>>>>> happens if all of the overseer eligible nodes are down. Your comment, and
>>>>> the old system, would imply that the overseer election goes to some other
>>>>> unrelated, untagged node. I disagree with this implementation choice. This
>>>>> sounds like something role specific to determine, but I would like to see
>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>> >>>>>
>>>>> >>>>> Noble wrote:
>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in the
>>>>> following request?
>>>>> >>>>>
>>>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's
>>>>> leave it as is.
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
>>>>> ichattopadhy...@gmail.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <md...@mdrob.com>
>>>>> wrote:
>>>>> >>>>>>>
>>>>> >>>>>>> Replying to the top post in this thread because there has been
>>>>> a lot of discussion and I don't want to look like I'm continuing any of
>>>>> those particular threads.
>>>>> >>>>>>>
>>>>> >>>>>>> I finally had time to sit down and think about this with the
>>>>> attention it deserves and am generally happy with how the conversation has
>>>>> shaped the current proposal.
>>>>> >>>>>>>
>>>>> >>>>>>> GOOD: I think using system properties to define node roles is
>>>>> fine and I like that data is the default role when not defined. I think it
>>>>> is important to hold on to the guarantee that an active overseer will land
>>>>> on an overseer node role.
>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for folks
>>>>> using the current OVERSEER role. I am not sure that something can be done
>>>>> automatically since they need to now specify new properties at startup.
>>>>> Maybe we need to include loud warnings or support both approaches for a
>>>>> time?
>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer
>>>>> nodes fail, then it is implied the overseer will go to one of the data
>>>>> nodes. The specific wording in the SIP - "When one or more such nodes are
>>>>> live, Solr guarantees that one of those nodes become the overseer." 
>>>>> implies
>>>>> to me that failover could go from overseer1 to overseer2 to overseerN to
>>>>> random node. I feel like we need to have some recording that there were
>>>>> dedicated overseer nodes and stop the cascading failure instead of 
>>>>> churning
>>>>> through our data nodes.
>>>>> >>>>>>>
>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope of
>>>>> "coordinator" roles from a split query/indexing standpoint. I understand
>>>>> that these are used as examples, but would like stronger language that new
>>>>> roles should also go through their own SIP discussions.
>>>>> >>>>>>>
>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node liveness
>>>>> in two different places now. We have the live nodes and we have the node
>>>>> roles stored in two different places in zookeeper and it feels like this
>>>>> would lead to race conditions or split brain or other hard to diagnose 
>>>>> bugs
>>>>> when those two lists don't agree with each other. This also feels like it
>>>>> contradicts the "single source of truth" idea later stated in the 
>>>>> proposal.
>>>>> I see Gus's arguments for decoupling these and am not strongly opposed, I
>>>>> just get a lurking feeling about it. Even if we don't do this, I would 
>>>>> like
>>>>> this called out explicitly in the alternative approaches section as
>>>>> something that we considered and rejected, with details why,
>>>>> >>>>>>>
>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional
>>>>> call out here that all operations are GET because nodes cannot be changed
>>>>> at runtime.
>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous
>>>>> OVERSEER preference role?
>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of available
>>>>> roles for a cluster. I _think_ this could be based on the version that the
>>>>> cluster is running? Would be useful to be able to interrogate a cluster in
>>>>> the future... we're seeing OOM issues on queries, can we add some query
>>>>> nodes? When were they introduced? I don't know what path this API should
>>>>> exist at.
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP
>>>>> document. Not sure if there's a better path that we could go for.
>>>>> >>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which
>>>>> parts are string literals and which parts are meant to be substituted by
>>>>> the operator? GET /api/cluster/roles/data would become GET
>>>>> /api/cluster/roles/${rolename} in our SIP/documentation.
>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1
>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate
>>>>> "nodes"
>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that
>>>>> intermediate "nodes" node.
>>>>> >>>>>>>
>>>>> >>>>>>> CLARIFICATION: Should listing roles require some permissions?
>>>>> Maybe this requirement is too fundamental to the operation of a cluster 
>>>>> and
>>>>> everybody would have to be able to do it.
>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to
>>>>> treat roles? Implementation detail that the servers will figure out? Or
>>>>> strict guidance where the client needs to check where specific roles are
>>>>> before sending any further communication to the server?
>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that it
>>>>> can't fulfil? An overseer node gets a query or an update. A data node gets
>>>>> a collection creation request. Do they forward it on to an appropriate
>>>>> node, or do they reject it? Should this be configurable? If not, then it
>>>>> seems like lazy or poorly configured clients will defeat this isolation
>>>>> system quite easily.
>>>>> >>>>>>>
>>>>> >>>>>>> GOOD: Testing the API is very important, yes.
>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when
>>>>> roles are added mean? I thought we established that they are not dynamic.
>>>>> >>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Thanks,
>>>>> >>>>>>> Mike
>>>>> >>>>>>>
>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>>>> ichattopadhy...@gmail.com> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> Hi,
>>>>> >>>>>>>>
>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>> >>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>> >>>>>>>>
>>>>> >>>>>>>> We also wish to add first class support for Query nodes that
>>>>> are used to process user queries by forwarding to data nodes,
>>>>> merging/aggregating them and presenting to users. This concept exists as
>>>>> first class citizens in most other search engines. This is a chance for
>>>>> Solr to catch up.
>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>> >>>>>>>>
>>>>> >>>>>>>> Regards,
>>>>> >>>>>>>> Ishan / Noble / Hitesh
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> http://www.needhamsoftware.com (work)
>>>>> >>> http://www.the111shift.com (play)
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>>>>> For additional commands, e-mail: dev-h...@solr.apache.org
>>>>>
>>>>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: First class support for node roles

Reply via email to