Re: First class support for node roles

Gus Heck Fri, 03 Dec 2021 07:10:24 -0800

I really don't think we should have types of roles. Not negative/positive
and not strict/non-strict. You have a role or you don't. What that means is
up to the code implementing the role.


Roles should be free to configure a preference order (binary, or n-ary or
whatever, strict or loose), prohibit behavior, or enable behavior. In this
SIP I feel we should focus on How to identify what node has what role, How
to designate what roles a node has via config/params, and the API's for
interacting with roles.

We should for example be able to support roles such as

PREFERRED_OVERSEER
DATA
NO_ROUTED_ALIAS  (just an example, not something I mean to suggest)

Details about role implementation should probably be discussed in a thread
about that role.  Obviously we should think about the name carefully to
leave options open should we want to enhance things later so maybe

OVERSEER_PREF  or just  OVERSEER

would be better since it merely reades that the node implements some sort
of preference or config regarding overseer... but all this can be decided
on a per role basis

On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <[email protected]> wrote:

> Negative roles have a place
>
> Example is overseer
>
> There are 3 possible choices for that role
>
> a) preferred: always be in front of the election queue
> b) on: not preferred, but can be an overseer if no preferred overseer
> nodes are available
> c) off: never become an overseer
>
> Today we only have options 'a' and 'b' . In a future ticket, we may
> implement C
>
> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <[email protected]> wrote:
>
>> Negative roles add a lot of complexity, I would really want to stay away
>> from them. That’s why I want strict roles up front. It’s maybe ok to push
>> this decision out, but it also seems like the sort of thing we should
>> consider at the start.
>>
>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <[email protected]> wrote:
>>
>>> Yes. Negative roles is not a bad idea. If I start a node for
>>> machine learning purposes, I wouldn't want that node to ever participate in
>>> overseer election
>>>
>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]> wrote:
>>>
>>>> If we have non strict roles (like overseer), then it does make sense
>>>> to have negative roles.
>>>> That way I can define which are the two nodes that I'd prefer the
>>>> overseer to run on, and a few other nodes on which it should
>>>> definitely never run for various reasons. And in case these
>>>> "!overseer" are the only nodes left in the cluster, let the cluster
>>>> fail the same way it would if there were no data nodes available.
>>>>
>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <[email protected]>
>>>> wrote:
>>>> >>>
>>>> >>> With the Strict/Loose option and sensible defaults, users cannot
>>>> trip themselves up by default, but the option is there for people to tinker
>>>> and have an iron grip over their cluster.
>>>> >>
>>>> >>
>>>> >> +1 to sensible defaults so users don't trip themselves. The option
>>>> to tinker for tighter grip can be tackled later, either on a per role basis
>>>> or as a generic concept later.
>>>> >
>>>> >
>>>> > +1 - Can definitely be added later if we so desire, not needed for
>>>> this SIP
>>>> >
>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]> wrote:
>>>> >>>
>>>> >>> I think the key  is to let the roles have full control of the
>>>> implications of having/not having that role. No need for even a
>>>> strict/loose designation. The question of do you have the role is yes/no
>>>> with no logic to guess if the role is implied or not, The question of will
>>>> it come up with the role is "have_explicit ? use_defaults : use_defaults.
>>>> >>>
>>>> >>> Once you figure out who has a role (or not) what that means is up
>>>> to the role code.
>>>> >>>
>>>> >>> Corollary: we don't have to change the way overseer works in this
>>>> SIP. We can rework it or not as we see fit separately.
>>>> >>
>>>> >>
>>>> >> +1
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>> Only thing we need to do is find a wording that makes the above
>>>> clear on first read through the SIP :)
>>>> >>>
>>>> >>> -Gus
>>>> >>>
>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <
>>>> [email protected]> wrote:
>>>> >>>>>
>>>> >>>>> This doesn't really address my concern around what happens if all
>>>> of our existing OVERSEER candidates are down. When at least one of them is
>>>> up, the overseer will go there, and that is good and expected. But what
>>>> happens if all of the overseer eligible nodes are down. Your comment, and
>>>> the old system, would imply that the overseer election goes to some other
>>>> unrelated, untagged node. I disagree with this implementation choice. This
>>>> sounds like something role specific to determine, but I would like to see
>>>> us be more strict about it. I don't want cores leaking out of my data
>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>> whatever. Overseer shouldn't be special in this regard.
>>>> >>>>
>>>> >>>>
>>>> >>>> I'm very strongly in favor of not letting users design a system in
>>>> which the cluster can be "live" without an overseer. I understand that the
>>>> overseer can be taxing to the cluster, but honestly what is the point of
>>>> having an untaxed cluster that doesn't have an overseer? I can see
>>>> arguments for the other roles to be stricter about this, but there are also
>>>> a lot of users who wouldn't want those to be strict either (like "query"
>>>> nodes).
>>>> >>>>
>>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer
>>>> role node HAS to be selected to become overseer, it will try to migrate the
>>>> overseer job to a node with the overseer role whenever one becomes live.
>>>> >>>>
>>>> >>>> So maybe we don't have special rules per role, but instead roles
>>>> can either be defined as "Strict" or "Loose" (better names likely exist),
>>>> and the roles come with a default (Overseer -> Loose, Data -> Strict, Query
>>>> -> Loose, etc.). And it is up to each role to define how to behave when
>>>> running in LOOSE mode and a non-role node is used then a role node comes
>>>> online (like the overseer example given above).
>>>> >>>>
>>>> >>>> With the Strict/Loose option and sensible defaults, users cannot
>>>> trip themselves up by default, but the option is there for people to tinker
>>>> and have an iron grip over their cluster.
>>>> >>>>
>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]> wrote:
>>>> >>>>>
>>>> >>>>> Noble wrote:
>>>> >>>>> > We are not modifying the way the "overseer role" works today.
>>>> We are just changing the definition and standardizing the configuration &
>>>> discoverability
>>>> >>>>> Ishan wrote:
>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER role
>>>> (which currently stands for preferred overseer). We can take a stab at
>>>> refactoring it later.
>>>> >>>>>
>>>> >>>>> Grouping these two comments together, since I think they are
>>>> saying the same thing. I think this is part of my confusion. We have an old
>>>> system that doesn't work the way we want the new system to work. There may
>>>> be people already using the old system. What path do we offer for folks
>>>> using the old system to migrate to the new system? What happens if somebody
>>>> accidentally tries to use both systems at the same time?
>>>> >>>>>
>>>> >>>>> Ishan wrote:
>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER role]
>>>> are live, Solr guarantees that one of those nodes becomes the overseer.", I
>>>> meant to somewhat capture the current behaviour as the OVERSEER role
>>>> performs today. Do you see any inconsistency with this statement vs. what
>>>> it does today?
>>>> >>>>>
>>>> >>>>> This doesn't really address my concern around what happens if all
>>>> of our existing OVERSEER candidates are down. When at least one of them is
>>>> up, the overseer will go there, and that is good and expected. But what
>>>> happens if all of the overseer eligible nodes are down. Your comment, and
>>>> the old system, would imply that the overseer election goes to some other
>>>> unrelated, untagged node. I disagree with this implementation choice. This
>>>> sounds like something role specific to determine, but I would like to see
>>>> us be more strict about it. I don't want cores leaking out of my data
>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>> whatever. Overseer shouldn't be special in this regard.
>>>> >>>>>
>>>> >>>>> Noble wrote:
>>>> >>>>> > If we do that how do we know if xyz is a role or a node in the
>>>> following request?
>>>> >>>>>
>>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's
>>>> leave it as is.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Replying to the top post in this thread because there has been
>>>> a lot of discussion and I don't want to look like I'm continuing any of
>>>> those particular threads.
>>>> >>>>>>>
>>>> >>>>>>> I finally had time to sit down and think about this with the
>>>> attention it deserves and am generally happy with how the conversation has
>>>> shaped the current proposal.
>>>> >>>>>>>
>>>> >>>>>>> GOOD: I think using system properties to define node roles is
>>>> fine and I like that data is the default role when not defined. I think it
>>>> is important to hold on to the guarantee that an active overseer will land
>>>> on an overseer node role.
>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for folks
>>>> using the current OVERSEER role. I am not sure that something can be done
>>>> automatically since they need to now specify new properties at startup.
>>>> Maybe we need to include loud warnings or support both approaches for a
>>>> time?
>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer nodes
>>>> fail, then it is implied the overseer will go to one of the data nodes. The
>>>> specific wording in the SIP - "When one or more such nodes are live, Solr
>>>> guarantees that one of those nodes become the overseer." implies to me that
>>>> failover could go from overseer1 to overseer2 to overseerN to random node.
>>>> I feel like we need to have some recording that there were dedicated
>>>> overseer nodes and stop the cascading failure instead of churning through
>>>> our data nodes.
>>>> >>>>>>>
>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope of
>>>> "coordinator" roles from a split query/indexing standpoint. I understand
>>>> that these are used as examples, but would like stronger language that new
>>>> roles should also go through their own SIP discussions.
>>>> >>>>>>>
>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node liveness
>>>> in two different places now. We have the live nodes and we have the node
>>>> roles stored in two different places in zookeeper and it feels like this
>>>> would lead to race conditions or split brain or other hard to diagnose bugs
>>>> when those two lists don't agree with each other. This also feels like it
>>>> contradicts the "single source of truth" idea later stated in the proposal.
>>>> I see Gus's arguments for decoupling these and am not strongly opposed, I
>>>> just get a lurking feeling about it. Even if we don't do this, I would like
>>>> this called out explicitly in the alternative approaches section as
>>>> something that we considered and rejected, with details why,
>>>> >>>>>>>
>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional
>>>> call out here that all operations are GET because nodes cannot be changed
>>>> at runtime.
>>>> >>>>>>> CLARIFICATION: How does this interact with the previous
>>>> OVERSEER preference role?
>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of available
>>>> roles for a cluster. I _think_ this could be based on the version that the
>>>> cluster is running? Would be useful to be able to interrogate a cluster in
>>>> the future... we're seeing OOM issues on queries, can we add some query
>>>> nodes? When were they introduced? I don't know what path this API should
>>>> exist at.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP
>>>> document. Not sure if there's a better path that we could go for.
>>>> >>>>>>
>>>> >>>>>>>
>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which parts
>>>> are string literals and which parts are meant to be substituted by the
>>>> operator? GET /api/cluster/roles/data would become GET
>>>> /api/cluster/roles/${rolename} in our SIP/documentation.
>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1
>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate
>>>> "nodes"
>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that
>>>> intermediate "nodes" node.
>>>> >>>>>>>
>>>> >>>>>>> CLARIFICATION: Should listing roles require some permissions?
>>>> Maybe this requirement is too fundamental to the operation of a cluster and
>>>> everybody would have to be able to do it.
>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to
>>>> treat roles? Implementation detail that the servers will figure out? Or
>>>> strict guidance where the client needs to check where specific roles are
>>>> before sending any further communication to the server?
>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that it
>>>> can't fulfil? An overseer node gets a query or an update. A data node gets
>>>> a collection creation request. Do they forward it on to an appropriate
>>>> node, or do they reject it? Should this be configurable? If not, then it
>>>> seems like lazy or poorly configured clients will defeat this isolation
>>>> system quite easily.
>>>> >>>>>>>
>>>> >>>>>>> GOOD: Testing the API is very important, yes.
>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when
>>>> roles are added mean? I thought we established that they are not dynamic.
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Thanks,
>>>> >>>>>>> Mike
>>>> >>>>>>>
>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>>> [email protected]> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> Hi,
>>>> >>>>>>>>
>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>> >>>>>>>>
>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>> >>>>>>>>
>>>> >>>>>>>> We also wish to add first class support for Query nodes that
>>>> are used to process user queries by forwarding to data nodes,
>>>> merging/aggregating them and presenting to users. This concept exists as
>>>> first class citizens in most other search engines. This is a chance for
>>>> Solr to catch up.
>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>> >>>>>>>>
>>>> >>>>>>>> Regards,
>>>> >>>>>>>> Ishan / Noble / Hitesh
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> http://www.needhamsoftware.com (work)
>>>> >>> http://www.the111shift.com (play)
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: First class support for node roles

Reply via email to