Re: First class support for node roles

Noble Paul Sat, 04 Dec 2021 21:47:16 -0800

typo


   -


On Sun, Dec 5, 2021 at 2:37 PM Noble Paul <[email protected]> wrote:

> I recommend the following format for the role spec
>
> roles=<role-name>:<role-value>
>
> each role will have an enum of allowed values and a default value
>
>
>    - role name: *data*
>       - values: [*on*, *off]*
>       - default: *allowed*
>
>

   - default: *on*


>    - role name: *overseer*
>       - values: [*allowed*, *disallowed*, *preferred]*
>       - default : *allowed*
>    - role name:* coordinator*
>       - values : [*on*, *off]*
>       - default: *off*
>
>
> examples
> roles=data:on,overseer:allowed (This is redundant because it uses all the
> default values. If a node is started without any roles value this is the
> default behavior)
> roles=data:off,overseer:preferred ( do not allow data, join overseer
> election at head)
> roles=coordinator:on,data:on (role as coordinator, but allow data, it's
> same as roles=coordinator:on)
> roles=coordinator:on,data:off (role as coordinator, disallow data)
>
>
> On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <[email protected]> wrote:
>
>> If we go with no negative node roles and overseer node role is not strict
>> (i.e. it’s a "preferred overseer"), then one would need to define a second
>> node role "no_overseer" to explicitly exclude a node from ever becoming
>> overseer (which I think is a useful feature until we switch the cluster
>> default to not using the overseer), plus the implementation of these two
>> node roles will obviously be coupled (and what if a node has both defined?).
>>
>> I prefer strict node roles.
>> Maybe we could have node roles with [optional] parameters to let the node
>> role implementation decide ?
>> The overseer node role for example could have one of 3 values defined for
>> each node: “preferred” (default, equivalent to the existing overseer role),
>> "accepted" (equivalent to currently not defining the overseer role) and
>> "no_way" (does not exist today).
>>
>> This could be useful in other contexts. A node role “data” could be
>> “fast” or “slow” depending on type of local persistent storage for example…
>>
>> Ilan
>>
>> On Fri 3 Dec 2021 at 16:10, Gus Heck <[email protected]> wrote:
>>
>>> I really don't think we should have types of roles. Not
>>> negative/positive and not strict/non-strict. You have a role or you don't.
>>> What that means is up to the code implementing the role.
>>>
>>> Roles should be free to configure a preference order (binary, or n-ary
>>> or whatever, strict or loose), prohibit behavior, or enable behavior. In
>>> this SIP I feel we should focus on How to identify what node has what role,
>>> How to designate what roles a node has via config/params, and the API's for
>>> interacting with roles.
>>>
>>> We should for example be able to support roles such as
>>>
>>> PREFERRED_OVERSEER
>>> DATA
>>> NO_ROUTED_ALIAS  (just an example, not something I mean to suggest)
>>>
>>> Details about role implementation should probably be discussed in a
>>> thread about that role.  Obviously we should think about the name carefully
>>> to leave options open should we want to enhance things later so maybe
>>>
>>> OVERSEER_PREF  or just  OVERSEER
>>>
>>> would be better since it merely reades that the node implements some
>>> sort of preference or config regarding overseer... but all this can be
>>> decided on a per role basis
>>>
>>> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <[email protected]> wrote:
>>>
>>>> Negative roles have a place
>>>>
>>>> Example is overseer
>>>>
>>>> There are 3 possible choices for that role
>>>>
>>>> a) preferred: always be in front of the election queue
>>>> b) on: not preferred, but can be an overseer if no preferred overseer
>>>> nodes are available
>>>> c) off: never become an overseer
>>>>
>>>> Today we only have options 'a' and 'b' . In a future ticket, we may
>>>> implement C
>>>>
>>>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <[email protected]> wrote:
>>>>
>>>>> Negative roles add a lot of complexity, I would really want to stay
>>>>> away from them. That’s why I want strict roles up front. It’s maybe ok to
>>>>> push this decision out, but it also seems like the sort of thing we should
>>>>> consider at the start.
>>>>>
>>>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes. Negative roles is not a bad idea. If I start a node for
>>>>>> machine learning purposes, I wouldn't want that node to ever participate 
>>>>>> in
>>>>>> overseer election
>>>>>>
>>>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> If we have non strict roles (like overseer), then it does make sense
>>>>>>> to have negative roles.
>>>>>>> That way I can define which are the two nodes that I'd prefer the
>>>>>>> overseer to run on, and a few other nodes on which it should
>>>>>>> definitely never run for various reasons. And in case these
>>>>>>> "!overseer" are the only nodes left in the cluster, let the cluster
>>>>>>> fail the same way it would if there were no data nodes available.
>>>>>>>
>>>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <
>>>>>>> [email protected]> wrote:
>>>>>>> >>>
>>>>>>> >>> With the Strict/Loose option and sensible defaults, users cannot
>>>>>>> trip themselves up by default, but the option is there for people to 
>>>>>>> tinker
>>>>>>> and have an iron grip over their cluster.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> +1 to sensible defaults so users don't trip themselves. The
>>>>>>> option to tinker for tighter grip can be tackled later, either on a per
>>>>>>> role basis or as a generic concept later.
>>>>>>> >
>>>>>>> >
>>>>>>> > +1 - Can definitely be added later if we so desire, not needed for
>>>>>>> this SIP
>>>>>>> >
>>>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
>>>>>>> [email protected]> wrote:
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <[email protected]>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> I think the key  is to let the roles have full control of the
>>>>>>> implications of having/not having that role. No need for even a
>>>>>>> strict/loose designation. The question of do you have the role is yes/no
>>>>>>> with no logic to guess if the role is implied or not, The question of 
>>>>>>> will
>>>>>>> it come up with the role is "have_explicit ? use_defaults : 
>>>>>>> use_defaults.
>>>>>>> >>>
>>>>>>> >>> Once you figure out who has a role (or not) what that means is
>>>>>>> up to the role code.
>>>>>>> >>>
>>>>>>> >>> Corollary: we don't have to change the way overseer works in
>>>>>>> this SIP. We can rework it or not as we see fit separately.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> +1
>>>>>>> >>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> Only thing we need to do is find a wording that makes the above
>>>>>>> clear on first read through the SIP :)
>>>>>>> >>>
>>>>>>> >>> -Gus
>>>>>>> >>>
>>>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <
>>>>>>> [email protected]> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>>>> them
>>>>>>> is up, the overseer will go there, and that is good and expected. But 
>>>>>>> what
>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, 
>>>>>>> and
>>>>>>> the old system, would imply that the overseer election goes to some 
>>>>>>> other
>>>>>>> unrelated, untagged node. I disagree with this implementation choice. 
>>>>>>> This
>>>>>>> sounds like something role specific to determine, but I would like to 
>>>>>>> see
>>>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> I'm very strongly in favor of not letting users design a system
>>>>>>> in which the cluster can be "live" without an overseer. I understand 
>>>>>>> that
>>>>>>> the overseer can be taxing to the cluster, but honestly what is the 
>>>>>>> point
>>>>>>> of having an untaxed cluster that doesn't have an overseer? I can see
>>>>>>> arguments for the other roles to be stricter about this, but there are 
>>>>>>> also
>>>>>>> a lot of users who wouldn't want those to be strict either (like "query"
>>>>>>> nodes).
>>>>>>> >>>>
>>>>>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer
>>>>>>> role node HAS to be selected to become overseer, it will try to migrate 
>>>>>>> the
>>>>>>> overseer job to a node with the overseer role whenever one becomes live.
>>>>>>> >>>>
>>>>>>> >>>> So maybe we don't have special rules per role, but instead
>>>>>>> roles can either be defined as "Strict" or "Loose" (better names likely
>>>>>>> exist), and the roles come with a default (Overseer -> Loose, Data ->
>>>>>>> Strict, Query -> Loose, etc.). And it is up to each role to define how 
>>>>>>> to
>>>>>>> behave when running in LOOSE mode and a non-role node is used then a 
>>>>>>> role
>>>>>>> node comes online (like the overseer example given above).
>>>>>>> >>>>
>>>>>>> >>>> With the Strict/Loose option and sensible defaults, users
>>>>>>> cannot trip themselves up by default, but the option is there for 
>>>>>>> people to
>>>>>>> tinker and have an iron grip over their cluster.
>>>>>>> >>>>
>>>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <[email protected]>
>>>>>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> Noble wrote:
>>>>>>> >>>>> > We are not modifying the way the "overseer role" works
>>>>>>> today. We are just changing the definition and standardizing the
>>>>>>> configuration & discoverability
>>>>>>> >>>>> Ishan wrote:
>>>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER
>>>>>>> role (which currently stands for preferred overseer). We can take a 
>>>>>>> stab at
>>>>>>> refactoring it later.
>>>>>>> >>>>>
>>>>>>> >>>>> Grouping these two comments together, since I think they are
>>>>>>> saying the same thing. I think this is part of my confusion. We have an 
>>>>>>> old
>>>>>>> system that doesn't work the way we want the new system to work. There 
>>>>>>> may
>>>>>>> be people already using the old system. What path do we offer for folks
>>>>>>> using the old system to migrate to the new system? What happens if 
>>>>>>> somebody
>>>>>>> accidentally tries to use both systems at the same time?
>>>>>>> >>>>>
>>>>>>> >>>>> Ishan wrote:
>>>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER
>>>>>>> role] are live, Solr guarantees that one of those nodes becomes the
>>>>>>> overseer.", I meant to somewhat capture the current behaviour as the
>>>>>>> OVERSEER role performs today. Do you see any inconsistency with this
>>>>>>> statement vs. what it does today?
>>>>>>> >>>>>
>>>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>>>> them
>>>>>>> is up, the overseer will go there, and that is good and expected. But 
>>>>>>> what
>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, 
>>>>>>> and
>>>>>>> the old system, would imply that the overseer election goes to some 
>>>>>>> other
>>>>>>> unrelated, untagged node. I disagree with this implementation choice. 
>>>>>>> This
>>>>>>> sounds like something role specific to determine, but I would like to 
>>>>>>> see
>>>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>>>> >>>>>
>>>>>>> >>>>> Noble wrote:
>>>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in
>>>>>>> the following request?
>>>>>>> >>>>>
>>>>>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's
>>>>>>> leave it as is.
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
>>>>>>> [email protected]> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <[email protected]>
>>>>>>> wrote:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Replying to the top post in this thread because there has
>>>>>>> been a lot of discussion and I don't want to look like I'm continuing 
>>>>>>> any
>>>>>>> of those particular threads.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> I finally had time to sit down and think about this with the
>>>>>>> attention it deserves and am generally happy with how the conversation 
>>>>>>> has
>>>>>>> shaped the current proposal.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> GOOD: I think using system properties to define node roles
>>>>>>> is fine and I like that data is the default role when not defined. I 
>>>>>>> think
>>>>>>> it is important to hold on to the guarantee that an active overseer will
>>>>>>> land on an overseer node role.
>>>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for
>>>>>>> folks using the current OVERSEER role. I am not sure that something can 
>>>>>>> be
>>>>>>> done automatically since they need to now specify new properties at
>>>>>>> startup. Maybe we need to include loud warnings or support both 
>>>>>>> approaches
>>>>>>> for a time?
>>>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer
>>>>>>> nodes fail, then it is implied the overseer will go to one of the data
>>>>>>> nodes. The specific wording in the SIP - "When one or more such nodes 
>>>>>>> are
>>>>>>> live, Solr guarantees that one of those nodes become the overseer." 
>>>>>>> implies
>>>>>>> to me that failover could go from overseer1 to overseer2 to overseerN to
>>>>>>> random node. I feel like we need to have some recording that there were
>>>>>>> dedicated overseer nodes and stop the cascading failure instead of 
>>>>>>> churning
>>>>>>> through our data nodes.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope
>>>>>>> of "coordinator" roles from a split query/indexing standpoint. I 
>>>>>>> understand
>>>>>>> that these are used as examples, but would like stronger language that 
>>>>>>> new
>>>>>>> roles should also go through their own SIP discussions.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node
>>>>>>> liveness in two different places now. We have the live nodes and we have
>>>>>>> the node roles stored in two different places in zookeeper and it feels
>>>>>>> like this would lead to race conditions or split brain or other hard to
>>>>>>> diagnose bugs when those two lists don't agree with each other. This 
>>>>>>> also
>>>>>>> feels like it contradicts the "single source of truth" idea later 
>>>>>>> stated in
>>>>>>> the proposal. I see Gus's arguments for decoupling these and am not
>>>>>>> strongly opposed, I just get a lurking feeling about it. Even if we 
>>>>>>> don't
>>>>>>> do this, I would like this called out explicitly in the alternative
>>>>>>> approaches section as something that we considered and rejected, with
>>>>>>> details why,
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional
>>>>>>> call out here that all operations are GET because nodes cannot be 
>>>>>>> changed
>>>>>>> at runtime.
>>>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous
>>>>>>> OVERSEER preference role?
>>>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of
>>>>>>> available roles for a cluster. I _think_ this could be based on the 
>>>>>>> version
>>>>>>> that the cluster is running? Would be useful to be able to interrogate a
>>>>>>> cluster in the future... we're seeing OOM issues on queries, can we add
>>>>>>> some query nodes? When were they introduced? I don't know what path this
>>>>>>> API should exist at.
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP
>>>>>>> document. Not sure if there's a better path that we could go for.
>>>>>>> >>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which
>>>>>>> parts are string literals and which parts are meant to be substituted by
>>>>>>> the operator? GET /api/cluster/roles/data would become GET
>>>>>>> /api/cluster/roles/${rolename} in our SIP/documentation.
>>>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1
>>>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate
>>>>>>> "nodes"
>>>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that
>>>>>>> intermediate "nodes" node.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: Should listing roles require some
>>>>>>> permissions? Maybe this requirement is too fundamental to the operation 
>>>>>>> of
>>>>>>> a cluster and everybody would have to be able to do it.
>>>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to
>>>>>>> treat roles? Implementation detail that the servers will figure out? Or
>>>>>>> strict guidance where the client needs to check where specific roles are
>>>>>>> before sending any further communication to the server?
>>>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that
>>>>>>> it can't fulfil? An overseer node gets a query or an update. A data node
>>>>>>> gets a collection creation request. Do they forward it on to an 
>>>>>>> appropriate
>>>>>>> node, or do they reject it? Should this be configurable? If not, then it
>>>>>>> seems like lazy or poorly configured clients will defeat this isolation
>>>>>>> system quite easily.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> GOOD: Testing the API is very important, yes.
>>>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when
>>>>>>> roles are added mean? I thought we established that they are not 
>>>>>>> dynamic.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Thanks,
>>>>>>> >>>>>>> Mike
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>>>>>> [email protected]> wrote:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Hi,
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>>>> >>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> We also wish to add first class support for Query nodes
>>>>>>> that are used to process user queries by forwarding to data nodes,
>>>>>>> merging/aggregating them and presenting to users. This concept exists as
>>>>>>> first class citizens in most other search engines. This is a chance for
>>>>>>> Solr to catch up.
>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Regards,
>>>>>>> >>>>>>>> Ishan / Noble / Hitesh
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> --
>>>>>>> >>> http://www.needhamsoftware.com (work)
>>>>>>> >>> http://www.the111shift.com (play)
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>
>
> --
> -----------------------------------------------------
> Noble Paul
>


-- 
-----------------------------------------------------
Noble Paul

Re: First class support for node roles

Reply via email to