Re: First class support for node roles

Gus Heck Sat, 04 Dec 2021 21:47:46 -0800

I like this in that it's an example of how the overseer might be extended
without creating a new role :)


Not entirely sure if I'm for or against an enum implementation here, but it
makes me a bit nervous. Enums with complexity can quickly get into
difficulty for unit tests (especially if one wanted to write a mock object
based test, something I think we maybe should use a bit more than we do).

I would tend to think of a class to represent and collect role related
functionality, one that perhaps has methods that receive the request, or
other key objects and thus could be tested without standing up an entire
server. (Not against also having them exercised in a few integrated tests,
but the more we can avoid interleaving logic directly within DispatchFilter
and HttpSolrCall etc. the better.

So I guess I'm somewhat biased against any enum with more than a couple
properties, and definitely don't want to wind up hanging lots of methods
off of one. Better to use them to consume a configuration value and then
instantiate a class that really holds the logic and data. I like them for
constraining values and easy string value conversion but the more they look
like classes the more I'd rather have a class.

-Gus

On Sat, Dec 4, 2021 at 10:37 PM Noble Paul <noble.p...@gmail.com> wrote:

> I recommend the following format for the role spec
>
> roles=<role-name>:<role-value>
>
> each role will have an enum of allowed values and a default value
>
>
>    - role name: *data*
>       - values: [*on*, *off]*
>       - default: *allowed*
>    - role name: *overseer*
>       - values: [*allowed*, *disallowed*, *preferred]*
>       - default : *allowed*
>    - role name:* coordinator*
>       - values : [*on*, *off]*
>       - default: *off*
>
>
> examples
> roles=data:on,overseer:allowed (This is redundant because it uses all the
> default values. If a node is started without any roles value this is the
> default behavior)
> roles=data:off,overseer:preferred ( do not allow data, join overseer
> election at head)
> roles=coordinator:on,data:on (role as coordinator, but allow data, it's
> same as roles=coordinator:on)
> roles=coordinator:on,data:off (role as coordinator, disallow data)
>
>
> On Sun, Dec 5, 2021 at 11:01 AM Ilan Ginzburg <ilans...@gmail.com> wrote:
>
>> If we go with no negative node roles and overseer node role is not strict
>> (i.e. it’s a "preferred overseer"), then one would need to define a second
>> node role "no_overseer" to explicitly exclude a node from ever becoming
>> overseer (which I think is a useful feature until we switch the cluster
>> default to not using the overseer), plus the implementation of these two
>> node roles will obviously be coupled (and what if a node has both defined?).
>>
>> I prefer strict node roles.
>> Maybe we could have node roles with [optional] parameters to let the node
>> role implementation decide ?
>> The overseer node role for example could have one of 3 values defined for
>> each node: “preferred” (default, equivalent to the existing overseer role),
>> "accepted" (equivalent to currently not defining the overseer role) and
>> "no_way" (does not exist today).
>>
>> This could be useful in other contexts. A node role “data” could be
>> “fast” or “slow” depending on type of local persistent storage for example…
>>
>> Ilan
>>
>> On Fri 3 Dec 2021 at 16:10, Gus Heck <gus.h...@gmail.com> wrote:
>>
>>> I really don't think we should have types of roles. Not
>>> negative/positive and not strict/non-strict. You have a role or you don't.
>>> What that means is up to the code implementing the role.
>>>
>>> Roles should be free to configure a preference order (binary, or n-ary
>>> or whatever, strict or loose), prohibit behavior, or enable behavior. In
>>> this SIP I feel we should focus on How to identify what node has what role,
>>> How to designate what roles a node has via config/params, and the API's for
>>> interacting with roles.
>>>
>>> We should for example be able to support roles such as
>>>
>>> PREFERRED_OVERSEER
>>> DATA
>>> NO_ROUTED_ALIAS  (just an example, not something I mean to suggest)
>>>
>>> Details about role implementation should probably be discussed in a
>>> thread about that role.  Obviously we should think about the name carefully
>>> to leave options open should we want to enhance things later so maybe
>>>
>>> OVERSEER_PREF  or just  OVERSEER
>>>
>>> would be better since it merely reades that the node implements some
>>> sort of preference or config regarding overseer... but all this can be
>>> decided on a per role basis
>>>
>>> On Thu, Dec 2, 2021 at 11:44 PM Noble Paul <noble.p...@gmail.com> wrote:
>>>
>>>> Negative roles have a place
>>>>
>>>> Example is overseer
>>>>
>>>> There are 3 possible choices for that role
>>>>
>>>> a) preferred: always be in front of the election queue
>>>> b) on: not preferred, but can be an overseer if no preferred overseer
>>>> nodes are available
>>>> c) off: never become an overseer
>>>>
>>>> Today we only have options 'a' and 'b' . In a future ticket, we may
>>>> implement C
>>>>
>>>> On Fri, Dec 3, 2021, 11:59 AM Mike Drob <md...@mdrob.com> wrote:
>>>>
>>>>> Negative roles add a lot of complexity, I would really want to stay
>>>>> away from them. That’s why I want strict roles up front. It’s maybe ok to
>>>>> push this decision out, but it also seems like the sort of thing we should
>>>>> consider at the start.
>>>>>
>>>>> On Thu, Dec 2, 2021 at 5:52 PM Noble Paul <noble.p...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes. Negative roles is not a bad idea. If I start a node for
>>>>>> machine learning purposes, I wouldn't want that node to ever participate 
>>>>>> in
>>>>>> overseer election
>>>>>>
>>>>>> On Fri, Dec 3, 2021, 6:50 AM Ilan Ginzburg <ilans...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> If we have non strict roles (like overseer), then it does make sense
>>>>>>> to have negative roles.
>>>>>>> That way I can define which are the two nodes that I'd prefer the
>>>>>>> overseer to run on, and a few other nodes on which it should
>>>>>>> definitely never run for various reasons. And in case these
>>>>>>> "!overseer" are the only nodes left in the cluster, let the cluster
>>>>>>> fail the same way it would if there were no data nodes available.
>>>>>>>
>>>>>>> On Thu, Dec 2, 2021 at 5:11 PM Houston Putman <
>>>>>>> houstonput...@gmail.com> wrote:
>>>>>>> >>>
>>>>>>> >>> With the Strict/Loose option and sensible defaults, users cannot
>>>>>>> trip themselves up by default, but the option is there for people to 
>>>>>>> tinker
>>>>>>> and have an iron grip over their cluster.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> +1 to sensible defaults so users don't trip themselves. The
>>>>>>> option to tinker for tighter grip can be tackled later, either on a per
>>>>>>> role basis or as a generic concept later.
>>>>>>> >
>>>>>>> >
>>>>>>> > +1 - Can definitely be added later if we so desire, not needed for
>>>>>>> this SIP
>>>>>>> >
>>>>>>> > On Wed, Dec 1, 2021 at 9:14 PM Ishan Chattopadhyaya <
>>>>>>> ichattopadhy...@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Thu, Dec 2, 2021 at 1:31 AM Gus Heck <gus.h...@gmail.com>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> I think the key  is to let the roles have full control of the
>>>>>>> implications of having/not having that role. No need for even a
>>>>>>> strict/loose designation. The question of do you have the role is yes/no
>>>>>>> with no logic to guess if the role is implied or not, The question of 
>>>>>>> will
>>>>>>> it come up with the role is "have_explicit ? use_defaults : 
>>>>>>> use_defaults.
>>>>>>> >>>
>>>>>>> >>> Once you figure out who has a role (or not) what that means is
>>>>>>> up to the role code.
>>>>>>> >>>
>>>>>>> >>> Corollary: we don't have to change the way overseer works in
>>>>>>> this SIP. We can rework it or not as we see fit separately.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> +1
>>>>>>> >>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> Only thing we need to do is find a wording that makes the above
>>>>>>> clear on first read through the SIP :)
>>>>>>> >>>
>>>>>>> >>> -Gus
>>>>>>> >>>
>>>>>>> >>> On Wed, Dec 1, 2021 at 2:50 PM Houston Putman <
>>>>>>> houstonput...@gmail.com> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>>>> them
>>>>>>> is up, the overseer will go there, and that is good and expected. But 
>>>>>>> what
>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, 
>>>>>>> and
>>>>>>> the old system, would imply that the overseer election goes to some 
>>>>>>> other
>>>>>>> unrelated, untagged node. I disagree with this implementation choice. 
>>>>>>> This
>>>>>>> sounds like something role specific to determine, but I would like to 
>>>>>>> see
>>>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> I'm very strongly in favor of not letting users design a system
>>>>>>> in which the cluster can be "live" without an overseer. I understand 
>>>>>>> that
>>>>>>> the overseer can be taxing to the cluster, but honestly what is the 
>>>>>>> point
>>>>>>> of having an untaxed cluster that doesn't have an overseer? I can see
>>>>>>> arguments for the other roles to be stricter about this, but there are 
>>>>>>> also
>>>>>>> a lot of users who wouldn't want those to be strict either (like "query"
>>>>>>> nodes).
>>>>>>> >>>>
>>>>>>> >>>> Maybe we just put in stronger guarantees that if a non-overseer
>>>>>>> role node HAS to be selected to become overseer, it will try to migrate 
>>>>>>> the
>>>>>>> overseer job to a node with the overseer role whenever one becomes live.
>>>>>>> >>>>
>>>>>>> >>>> So maybe we don't have special rules per role, but instead
>>>>>>> roles can either be defined as "Strict" or "Loose" (better names likely
>>>>>>> exist), and the roles come with a default (Overseer -> Loose, Data ->
>>>>>>> Strict, Query -> Loose, etc.). And it is up to each role to define how 
>>>>>>> to
>>>>>>> behave when running in LOOSE mode and a non-role node is used then a 
>>>>>>> role
>>>>>>> node comes online (like the overseer example given above).
>>>>>>> >>>>
>>>>>>> >>>> With the Strict/Loose option and sensible defaults, users
>>>>>>> cannot trip themselves up by default, but the option is there for 
>>>>>>> people to
>>>>>>> tinker and have an iron grip over their cluster.
>>>>>>> >>>>
>>>>>>> >>>> On Wed, Dec 1, 2021 at 2:24 PM Mike Drob <md...@mdrob.com>
>>>>>>> wrote:
>>>>>>> >>>>>
>>>>>>> >>>>> Noble wrote:
>>>>>>> >>>>> > We are not modifying the way the "overseer role" works
>>>>>>> today. We are just changing the definition and standardizing the
>>>>>>> configuration & discoverability
>>>>>>> >>>>> Ishan wrote:
>>>>>>> >>>>> > As of this SIP, we're not planning to modify the OVERSEER
>>>>>>> role (which currently stands for preferred overseer). We can take a 
>>>>>>> stab at
>>>>>>> refactoring it later.
>>>>>>> >>>>>
>>>>>>> >>>>> Grouping these two comments together, since I think they are
>>>>>>> saying the same thing. I think this is part of my confusion. We have an 
>>>>>>> old
>>>>>>> system that doesn't work the way we want the new system to work. There 
>>>>>>> may
>>>>>>> be people already using the old system. What path do we offer for folks
>>>>>>> using the old system to migrate to the new system? What happens if 
>>>>>>> somebody
>>>>>>> accidentally tries to use both systems at the same time?
>>>>>>> >>>>>
>>>>>>> >>>>> Ishan wrote:
>>>>>>> >>>>> > When I wrote "When one or more such nodes [with OVERSEER
>>>>>>> role] are live, Solr guarantees that one of those nodes becomes the
>>>>>>> overseer.", I meant to somewhat capture the current behaviour as the
>>>>>>> OVERSEER role performs today. Do you see any inconsistency with this
>>>>>>> statement vs. what it does today?
>>>>>>> >>>>>
>>>>>>> >>>>> This doesn't really address my concern around what happens if
>>>>>>> all of our existing OVERSEER candidates are down. When at least one of 
>>>>>>> them
>>>>>>> is up, the overseer will go there, and that is good and expected. But 
>>>>>>> what
>>>>>>> happens if all of the overseer eligible nodes are down. Your comment, 
>>>>>>> and
>>>>>>> the old system, would imply that the overseer election goes to some 
>>>>>>> other
>>>>>>> unrelated, untagged node. I disagree with this implementation choice. 
>>>>>>> This
>>>>>>> sounds like something role specific to determine, but I would like to 
>>>>>>> see
>>>>>>> us be more strict about it. I don't want cores leaking out of my data
>>>>>>> roles, I don't want query processing to leak out of my "query" nodes or
>>>>>>> whatever. Overseer shouldn't be special in this regard.
>>>>>>> >>>>>
>>>>>>> >>>>> Noble wrote:
>>>>>>> >>>>> > If we do that how do we know if xyz is a role or a node in
>>>>>>> the following request?
>>>>>>> >>>>>
>>>>>>> >>>>> You're absolutely correct, thanks for pointing this out. Let's
>>>>>>> leave it as is.
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>>
>>>>>>> >>>>> On Tue, Nov 30, 2021 at 2:21 PM Ishan Chattopadhyaya <
>>>>>>> ichattopadhy...@gmail.com> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Tue, Nov 30, 2021 at 12:53 AM Mike Drob <md...@mdrob.com>
>>>>>>> wrote:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Replying to the top post in this thread because there has
>>>>>>> been a lot of discussion and I don't want to look like I'm continuing 
>>>>>>> any
>>>>>>> of those particular threads.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> I finally had time to sit down and think about this with the
>>>>>>> attention it deserves and am generally happy with how the conversation 
>>>>>>> has
>>>>>>> shaped the current proposal.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> GOOD: I think using system properties to define node roles
>>>>>>> is fine and I like that data is the default role when not defined. I 
>>>>>>> think
>>>>>>> it is important to hold on to the guarantee that an active overseer will
>>>>>>> land on an overseer node role.
>>>>>>> >>>>>>> CHANGE REQUEST: I would like to see a migration path for
>>>>>>> folks using the current OVERSEER role. I am not sure that something can 
>>>>>>> be
>>>>>>> done automatically since they need to now specify new properties at
>>>>>>> startup. Maybe we need to include loud warnings or support both 
>>>>>>> approaches
>>>>>>> for a time?
>>>>>>> >>>>>>> CHANGE REQUEST: I do not like that if all of the overseer
>>>>>>> nodes fail, then it is implied the overseer will go to one of the data
>>>>>>> nodes. The specific wording in the SIP - "When one or more such nodes 
>>>>>>> are
>>>>>>> live, Solr guarantees that one of those nodes become the overseer." 
>>>>>>> implies
>>>>>>> to me that failover could go from overseer1 to overseer2 to overseerN to
>>>>>>> random node. I feel like we need to have some recording that there were
>>>>>>> dedicated overseer nodes and stop the cascading failure instead of 
>>>>>>> churning
>>>>>>> through our data nodes.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: I am slightly confused by the proposed scope
>>>>>>> of "coordinator" roles from a split query/indexing standpoint. I 
>>>>>>> understand
>>>>>>> that these are used as examples, but would like stronger language that 
>>>>>>> new
>>>>>>> roles should also go through their own SIP discussions.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: I do not like that we are storing node
>>>>>>> liveness in two different places now. We have the live nodes and we have
>>>>>>> the node roles stored in two different places in zookeeper and it feels
>>>>>>> like this would lead to race conditions or split brain or other hard to
>>>>>>> diagnose bugs when those two lists don't agree with each other. This 
>>>>>>> also
>>>>>>> feels like it contradicts the "single source of truth" idea later 
>>>>>>> stated in
>>>>>>> the proposal. I see Gus's arguments for decoupling these and am not
>>>>>>> strongly opposed, I just get a lurking feeling about it. Even if we 
>>>>>>> don't
>>>>>>> do this, I would like this called out explicitly in the alternative
>>>>>>> approaches section as something that we considered and rejected, with
>>>>>>> details why,
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> GOOD: The API looks pretty clear. I would like an additional
>>>>>>> call out here that all operations are GET because nodes cannot be 
>>>>>>> changed
>>>>>>> at runtime.
>>>>>>> >>>>>>> CLARIFICATION: How does this interact with the previous
>>>>>>> OVERSEER preference role?
>>>>>>> >>>>>>> CHANGE REQUEST: An additional API to get the list of
>>>>>>> available roles for a cluster. I _think_ this could be based on the 
>>>>>>> version
>>>>>>> that the cluster is running? Would be useful to be able to interrogate a
>>>>>>> cluster in the future... we're seeing OOM issues on queries, can we add
>>>>>>> some query nodes? When were they introduced? I don't know what path this
>>>>>>> API should exist at.
>>>>>>> >>>>>>
>>>>>>> >>>>>>
>>>>>>> >>>>>> Added a GET /api/cluster/roles/supported API, updated the SIP
>>>>>>> document. Not sure if there's a better path that we could go for.
>>>>>>> >>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: Can we list the APIs to clearly show which
>>>>>>> parts are string literals and which parts are meant to be substituted by
>>>>>>> the operator? GET /api/cluster/roles/data would become GET
>>>>>>> /api/cluster/roles/${rolename} in our SIP/documentation.
>>>>>>> >>>>>>> CHANGE REQUEST: I think GET /api/cluster/roles/nodes/node1
>>>>>>> should be GET /api/cluster/roles/${nodename} dropping the intermediate
>>>>>>> "nodes"
>>>>>>> >>>>>>> CHANGE REQUEST: The ZK structure also might not need that
>>>>>>> intermediate "nodes" node.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> CLARIFICATION: Should listing roles require some
>>>>>>> permissions? Maybe this requirement is too fundamental to the operation 
>>>>>>> of
>>>>>>> a cluster and everybody would have to be able to do it.
>>>>>>> >>>>>>> CLARIFICATION: How do we expect SolrJ (and other clients) to
>>>>>>> treat roles? Implementation detail that the servers will figure out? Or
>>>>>>> strict guidance where the client needs to check where specific roles are
>>>>>>> before sending any further communication to the server?
>>>>>>> >>>>>>> CLARIFICATION: What happens when a node gets a request that
>>>>>>> it can't fulfil? An overseer node gets a query or an update. A data node
>>>>>>> gets a collection creation request. Do they forward it on to an 
>>>>>>> appropriate
>>>>>>> node, or do they reject it? Should this be configurable? If not, then it
>>>>>>> seems like lazy or poorly configured clients will defeat this isolation
>>>>>>> system quite easily.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> GOOD: Testing the API is very important, yes.
>>>>>>> >>>>>>> CLARIFICATION: What does testing for how nodes behave when
>>>>>>> roles are added mean? I thought we established that they are not 
>>>>>>> dynamic.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Thanks,
>>>>>>> >>>>>>> Mike
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:17 AM Ishan Chattopadhyaya <
>>>>>>> ichattopadhy...@gmail.com> wrote:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Hi,
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Here's an SIP for introducing the concept of node roles:
>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694
>>>>>>> >>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> We also wish to add first class support for Query nodes
>>>>>>> that are used to process user queries by forwarding to data nodes,
>>>>>>> merging/aggregating them and presenting to users. This concept exists as
>>>>>>> first class citizens in most other search engines. This is a chance for
>>>>>>> Solr to catch up.
>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Regards,
>>>>>>> >>>>>>>> Ishan / Noble / Hitesh
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> --
>>>>>>> >>> http://www.needhamsoftware.com (work)
>>>>>>> >>> http://www.the111shift.com (play)
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@solr.apache.org
>>>>>>>
>>>>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>
>
> --
> -----------------------------------------------------
> Noble Paul
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: First class support for node roles

Reply via email to