Re: First class support for node roles

Ishan Chattopadhyaya Tue, 02 Nov 2021 16:13:05 -0700

Hi Tim,
Here are my responses inline.

On Wed, Nov 3, 2021 at 3:22 AM Timothy Potter <[email protected]> wrote:


> I'm just not convinced this feature is even needed and the SIP is not
> convincing that "There is no proper alternative today."
>

There are no proper alternatives today, just hacks. On 8x, we have two
different deprecated frameworks to stop nodes from being placed on a node
(1. rule based replica placement, 2. autoscaling framework). On 9x, we have
a new autoscaling framework, which I don't even think is fully implemented.
And, there's definitely no way to have a node act as a query coordinator
without having data on it.


>
> 1) Just b/c Elastic and Vespa have a concept of node roles, doesn't
> mean Solr needs this.


Solr needs this. Elastic has such concepts is a coincidence, and also means
we have an opportunity to catch up with them; they have these concepts for
a reason.


> Also, some of Elastic's roles overlap with
> concepts Solr already has in a different form, i.e data_hot sounds
> like NRT and data_warm sounds a lot like our Pull Replica Type
>

I think that is beyond the scope of this SIP.


>
> 2) You can achieve the "coordinator" role with auto-scaling rules
> pre-9.x and with the AffinityPlacementPlugin (heck, it even has a node
> type built in:
> .requestNodeSystemProperty(AffinityPlacementConfig.NODE_TYPE_SYSPROP).
> Simply build your replica placement rules such that no replicas land
> on "coordinator" nodes. And you can route queries using node.sysprop
> already using shards.preference.
>

I think you missed the whole point of the query coordinator. Please refer
to this https://issues.apache.org/jira/browse/SOLR-15715.
Let me summarize the main difference between what (I think) you refer to
and what is proposed in SOLR-15715.

With your suggestion, we'll have a node that doesn't host any replicas. And
you suggest queries landing on such nodes be routed using
shards.preference? Well, in such a case, these queries will be
forwarded/proxied to a random node hosting a replica of the collection and
that node then acts as the coordinator. This situation is no better than
sending the query directly to that particular node.

What is proposed in SOLR-15715 is a query aggregation functionality. There
will be pseudo replicas (aware of the configset) on this coordinator node
that handle the request themselves, sends shard requests to data hosting
replicas, collects responses and merges them, and sends back to the user.
This merge step is usually extremely memory intensive, and it would be good
to serve these off stateless nodes (that host no data).


>
> 3) Dedicated overseer role? I thought we were removing the overseer?!?
> Also, we already have the ability to run the overseer on specific
> nodes w/o a new framework, so this doesn't really convince me we need
> a new framework.
>

There's absolutely no change proposed to the "overseer" role. What users
need on production clusters are nodes dedicated for overseer operations,
and for that the current "overseer" role suffices, together with some
functionality to not place replicas on such nodes.


>
> 4) We will indeed need to decide which nodes host embedded Zookeeper's
> but I'd argue that solution hasn't been designed entirely and we
> probably don't need a formal node role framework to determine which
> nodes host embedded ZKs. Moreover, embedded ZK seems more like a small
> cluster thing and anyone running a large cluster will probably have a
> dedicated ZK ensemble as they do today. The node role thing seems like
> it's intended for large clusters and my gut says few will use embedded
> ZK for large clusters.
>

This SIP is not the right place for this discussion. There's a separate SIP
for this.


>
> 5) You can also achieve a lot of "node role" functionality in query
> routing using the shards.preference parameter.
>
>
That doesn't solve the purpose behind
https://issues.apache.org/jira/browse/SOLR-15715.


> At the very least, the SIP needs to list specific use cases that
> require this feature that are not achievable with the current features
> before getting bogged down in the impl. details.
>

The coordinator role is the biggest motivation for introducing the concept
of roles. However, in addition to what is proposed in SOLR-15715, a
coordinator node can later on also be used as a node for users to run
streaming expressions on, do bulk indexing on (impl details for this to
come later, don't want distraction here).


>
> Tim
>
> On Tue, Nov 2, 2021 at 3:20 PM Gus Heck <[email protected]> wrote:
> >
> > I think there are things not yet accounted for. Time I spent yesterday
> is biting me today. Pls give a couple days.
> >
> > On Tue, Nov 2, 2021 at 11:28 AM Jason Gerlowski <[email protected]>
> wrote:
> >>
> >> Hey Ishan,
> >>
> >> I appreciate you writing up the SIP!  Here's some notes/questions I
> >> had as I was reading through your writeup and this mail thread.
> >> ("----" separators between thoughts, hopefully that helps.)
> >>
> >> ----
> >>
> >> I'll add my vote to what Jan, Gus, Ilan, and Houston already
> >> suggested: roles should default to "all-on".  I see the downsides
> >> you're worried about with that approach (esp. around 'overseer'), but
> >> they may be mitigatable, at least in part.
> >>
> >> > [mail thread] User wants this node Solr101 to be a dedicated
> overseer, but for that to happen, he/she would need to restart all the data
> nodes with -Dnode.roles=data
> >>
> >> Sure, if roles can only be specified at startup.  But that may be a
> >> self-imposed constraint.
> >>
> >> An API to change a node's roles would remove the need for a restart
> >> and make it easy for users to affect the semantics they want.  You
> >> decided you want a dedicated overseer N nodes into your cluster
> >> deployment?  Deploy node 'N' with the 'overseer', and toggle the
> >> overseer role off on the remainder.
> >>
> >> Now, I understand that you don't want roles to change at runtime, but
> >> I haven't seen you get much into "why", beyond saying "it is very
> >> risky to have nodes change roles while they are up and running."  Can
> >> you expand a bit on the risks you're worried about?  If you're
> >> explicit about them here maybe someone can think of a clever way to
> >> address them?
> >>
> >> > Hence, if those nodes are "assumed to have all roles", then just by
> virtue of upgrading to this new version, new capabilities will be turned on
> for the entire cluster, whether or not the user opted for such a
> capability. This is totally undesirable.
> >>
> >> Obviously "roles" refer to much bigger chunks of functionality than
> >> usual, so in a sense defaulting roles on is scarier.  But in a sense
> >> you're describing something that's an inherent part of software
> >> releases.  Releases expose new features that are typically on by
> >> default.  A new default-on role in 9.1 might hurt a user, but there's
> >> no fundamental difference between that and a change to backups or
> >> replication or whatever in the same release.
> >>
> >> I don't mean to belittle the difference in scope - I get your concern.
> >> But IMO this is something to address with good release notes and
> >> documentation.  Designing for admins who don't do even cursory
> >> research before an upgrade ties both our hands behind our back as a
> >> project.
> >>
> >> ----
> >>
> >> > [SIP] Internal representation in ZK ... Implementation details like
> these can be fleshed out in the PR
> >>
> >> IMO this is important enough to flush out as part of the SIP, at least
> >> in broad strokes.  It affects backcompat, SolrJ client design, etc.
> >>
> >> ----
> >>
> >> > [SIP] GET /api/cluster/roles?node=node1
> >>
> >> Woohoo - way to include a v2 API definition!
> >>
> >> AFAIR, the v2 API has a /nodes path defined - I wonder whether "GET
> >> /nodes/someNode/roles" wouldn't be a more intuitive endpoint for the
> >> "get the roles this node has" functionality.  Though I leave that for
> >> your consideration.
> >>
> >> ----
> >>
> >> Looking forward to your responses and seeing the SIP progress!  It's a
> >> really cool, promising idea IMO.
> >>
> >> Best,
> >>
> >> Jason
> >>
> >> On Tue, Nov 2, 2021 at 11:21 AM Ishan Chattopadhyaya
> >> <[email protected]> wrote:
> >> >
> >> > Are there any unaddressed outstanding concerns that we should hold up
> the SIP for?
> >> >
> >> > On Mon, 1 Nov, 2021, 10:31 pm Ishan Chattopadhyaya, <
> [email protected]> wrote:
> >> >>>
> >> >>> >> Agree. However, I disagree with ideas where "query analysis" has
> a role of its own. Where would that lead us to? Separate roles for
> >> >>>
> >> >>> >> nodes that do "faceting" or "spell correction" etc.? But anyway,
> that is for discussion when we add future roles. This is beyond this SIP.
> >> >>
> >> >>
> >> >> > I am not asking you to implement every possible role of course :).
> As a note I know a company that is running an entire separate
> >> >> > cluster to offload and better serve highlighting on a subset of
> large docs, so YES I think there are people who may want such fine grained
> control.
> >> >>
> >> >> Cool, I think we can discuss adding any additional roles (for
> highlighting?) on a case by case basis at a later point.
> >> >>
> >> >>
> >> >> On Mon, Nov 1, 2021 at 10:25 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >> >>>
> >> >>> > Boiling it down the idea I'm proposing is that roles required for
> back compatibility get explicitly added on startup, if not by the user then
> by the code. This is more flexible than assuming that no role means every
> role, because then every new feature that has a role will end up on legacy
> clusters which are also not back compatible.
> >> >>>
> >> >>> +1, I totally agree. I even said so, when I said: "This is why I
> was advocating that 1) we assume the "data" as a default, 2) not assume
> overseer to be implicitly defined (because of the way overseer role is
> written today), 3) not assume any future roles to be true by default."
> >> >>>
> >> >>> So, basically, I'm proposing that the "roles required for back
> compatibility" (that should be explicitly added on startup) be just the
> ["data"] role, and not the "overseer" role (due to the way overseer role is
> currently defined, i.e. it is "preferred overseer").
> >> >>>
> >> >>> On Mon, Nov 1, 2021 at 10:19 PM Gus Heck <[email protected]>
> wrote:
> >> >>>>
> >> >>>> Very sorry don't mean to sound offended, Frustrated yes offended
> no :)... the most difficult thing about communication is the illusion it
> has occurred :)
> >> >>>>
> >> >>>> If you read back just a few emails you'll see where I talk about
> roles being applied on startup. Boiling it down the idea I'm proposing is
> that roles required for back compatibility get explicitly added on startup,
> if not by the user then by the code. This is more flexible than assuming
> that no role means every role, because then every new feature that has a
> role will end up on legacy clusters which are also not back compatible.
> >> >>>>
> >> >>>> There are points where I said all roles rather than back
> compatibility roles because I was thinking about back compatibility
> specifically, but you can't know that if I don't say that can you :).
> >> >>>>
> >> >>>> On Mon, Nov 1, 2021 at 12:39 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >> >>>>>
> >> >>>>> > If you read more closely, my way can provide full back
> compatibility. To say or imply it doesn't isn't helping. Perhaps you need
> to re-read?
> >> >>>>>
> >> >>>>> I understand e-mails are frustrating, and I'm trying my best.
> Please don't be offended, and kindly point me to the exact part you want me
> to re-read.
> >> >>>>>
> >> >>>>> On Mon, Nov 1, 2021 at 10:05 PM Gus Heck <[email protected]>
> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Mon, Nov 1, 2021 at 12:22 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >> >>>>>>>
> >> >>>>>>> >    Positive - They denote the existence of a capability
> >> >>>>>>>
> >> >>>>>>> Agree, the SIP already reflects this.
> >> >>>>>>>
> >> >>>>>>> >   Absolute - Absence/Presence binary identification of a
> capability; no implications, no assumptions
> >> >>>>>>>
> >> >>>>>>> Disagree, we need backcompat handling on nodes running without
> any roles. There has to be an implicit assumption as to what roles are
> those nodes assumed to have. My proposal is that only the "data" role be
> assumed, but not the "overseer" role. For any future roles ("coordinator",
> "zookeeper" etc.), this decision as to what absence of any role implies
> should be left to the implementation of that future role. Documentation
> should reflect clearly about these implicit assumptions.
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> If you read more closely, my way can provide full back
> compatibility. To say or imply it doesn't isn't helping. Perhaps you need
> to re-read?
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>> >    Focused - Do one thing per role
> >> >>>>>>>
> >> >>>>>>> Agree. However, I disagree with ideas where "query analysis"
> has a role of its own. Where would that lead us to? Separate roles for
> nodes that do "faceting" or "spell correction" etc.? But anyway, that is
> for discussion when we add future roles. This is beyond this SIP.
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>> I am not asking you to implement every possible role of course
> :). As a note I know a company that is running an entire separate cluster
> to offload and better serve highlighting on a subset of large docs, so YES
> I think there are people who may want such fine grained control.
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>> >    Accessible - It should be dead simple to determine the
> members of a role, avoid parsing blobs of json, avoid calculating
> implications, avoid consulting other resources after listing nodes with the
> role
> >> >>>>>>>
> >> >>>>>>> Agree. I'm open to any implementation details that make it
> easy. There should be a reasonable API to return these node roles, with
> ability to filter by role or filter by node.
> >> >>>>>>>
> >> >>>>>>> >    Independent - One role should not require other roles to
> be present
> >> >>>>>>>
> >> >>>>>>> Do we need to have this hard and fast requirement upfront?
> There might be situations where this is desirable. I feel we can discuss on
> a case by case basis whenever a future role is added.
> >> >>>>>>>
> >> >>>>>>> >    Persistent - roles should not be lost across reboot
> >> >>>>>>>
> >> >>>>>>> Agree.
> >> >>>>>>>
> >> >>>>>>> >    Immutable - roles should not change while the node is
> running
> >> >>>>>>>
> >> >>>>>>> Agree
> >> >>>>>>>
> >> >>>>>>> >    Lively - A node with a capability may not be presently
> providing that capability.
> >> >>>>>>>
> >> >>>>>>> I don't understand, can you please elaborate?
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Specifically imagine the case where there are 100 nodes:
> >> >>>>>> 1-100 ==> DATA
> >> >>>>>> 101-103 ==> OVERSEER
> >> >>>>>> 104-106 ==> ZOOKEEPER
> >> >>>>>>
> >> >>>>>> But you won't have 3 overseers... you'll want only one of those
> to be providing overseer functionality and the other two to be capable, but
> not providing (so that if the current overseer goes down a new one can be
> assigned).
> >> >>>>>>
> >> >>>>>> Then you decide you'd ike 5 Zookeepers. You start nodes 107-108
> with that role, but you probably want to ensure that zookeepers require
> some sort of command for them to actually join the zookeeper cluster (i.e.
> /admin?action=ZKADD&nodes=node107,node18) ... to do that the nodes need to
> be up. But oh look I typoed 108... we want that to fail... how? because 18
> does not have the capability to become a zookeeper.
> >> >>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Mon, Nov 1, 2021 at 9:30 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >> >>>>>>>>
> >> >>>>>>>> > Ilan: A node not having node.roles defined should be assumed
> to have all roles. Not only data. I don't see a reason to special case this
> one or any role.
> >> >>>>>>>> > Gus: There should be no "assumptions" Nothing to figure out.
> A node has a role or not. For back compatibility reasons, all roles would
> be assumed on startup if none specified.
> >> >>>>>>>> > Jan: No role == all roles. Explicit list of roles = exactly
> those roles.
> >> >>>>>>>>
> >> >>>>>>>> Problem with this approach is mainly to do with backcompat.
> >> >>>>>>>>
> >> >>>>>>>> 1. Overseer backcompat:
> >> >>>>>>>> If we don't make any modifications to how overseer works and
> adopt this approach (as quoted), then imagine this situation:
> >> >>>>>>>>
> >> >>>>>>>> Solr1-100: No roles param (assumed to be "data,overseer").
> >> >>>>>>>> Solr101: -Dnode.roles=overseer (intention: dedicated overseer)
> >> >>>>>>>>
> >> >>>>>>>> User wants this node Solr101 to be a dedicated overseer, but
> for that to happen, he/she would need to restart all the data nodes with
> -Dnode.roles=data. This will cause unnecessary disruption to running
> clusters where a dedicated overseer is needed. Keep in mind, if a user
> needs a dedicated overseer, he's likely in an emergency situation and
> restarting the whole cluster might not be viable for him/her.
> >> >>>>>>>>
> >> >>>>>>>> 2. Future roles might not be compatible with this "assumed to
> have all roles" idea:
> >> >>>>>>>> Take the proposed "zookeeper" role for example. Today, regular
> nodes are not supposed to have embedded ZK running on them. By introducing
> this artificial limitation ("assumed to have all roles"), we constrain
> adoption of all future roles to necessarily require a full cluster restart.
> >> >>>>>>>>
> >> >>>>>>>> Keep in mind newer Solr versions can introduce new
> capabilities and roles. Imagine we have a role that is defined in a new
> Solr version (and there's functionality to go with that role), and user
> upgrades to that version. However, his/her nodes all were started with no
> node.roles param. Hence, if those nodes are "assumed to have all roles",
> then just by virtue of upgrading to this new version, new capabilities will
> be turned on for the entire cluster, whether or not the user opted for such
> a capability. This is totally undesirable.
> >> >>>>>>>>
> >> >>>>>>>> > Gus: I actually don't want a coordinator to do more work, I
> would prefer small focused roles with names that accurately describe their
> function. In that light, COORDINATOR might be too nebulous. How about
> AGREGATOR role? (what I was thinking of would better be called a
> QUERY_ANALYSIS role)
> >> >>>>>>>>
> >> >>>>>>>> If you want to do specific things like query analysis or query
> aggregation or bulk indexing etc, all of those can be done on COORDINATOR
> nodes (as is the case in ElasticSearch). Having tens of of " small focused
> roles" defined as first class concepts would be confusing to the user. As a
> remedy to your situation where you want the coordinator role to also do
> query-analysis for shards, one possible solution is to send such a query to
> a coordinator node with a parameter like "coordinator.query_analysis=true",
> and then the coordinator, instead of blindly hitting remote shards, also
> does some extra work on behalf of the shards.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On Mon, Nov 1, 2021 at 9:01 PM Ishan Chattopadhyaya <
> [email protected]> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> > If we make collections role-aware for example (replicas of
> that collection can only be
> >> >>>>>>>>> > placed on nodes with a specific role, in addition to the
> other role based constraints),
> >> >>>>>>>>> > the set of roles should be user extensible and not fixed.
> >> >>>>>>>>> > If collections are not role aware, the constraints
> introduced by roles apply to all collections
> >> >>>>>>>>> > equally which might be insufficient if a user needs for
> example a heavily used collection to
> >> >>>>>>>>> > only be placed on more powerful nodes.
> >> >>>>>>>>>
> >> >>>>>>>>> I feel node roles and role-aware collections are orthogonal
> topics. What you describe above can be achieved by the autoscaling+replica
> placement framework where the placement plugins take the node roles as one
> of the inputs.
> >> >>>>>>>>>
> >> >>>>>>>>> > It does impact the design from early on: the set of roles
> need to be expandable by a user
> >> >>>>>>>>> > by creating a collection with new roles for example
> (consumed by placement plugins) and be
> >> >>>>>>>>> > able to start nodes with new (arbitrary) roles. Should such
> roles follow some naming syntax to
> >> >>>>>>>>> > differentiate them from built in roles? To be able to fail
> on typos on roles - that otherwise can be
> >> >>>>>>>>> > crippling and hard to debug. This implies in any case that
> the current design can't assume all
> >> >>>>>>>>> > roles are known at compile time or define them in a Java
> enum.
> >> >>>>>>>>>
> >> >>>>>>>>> I think this should be achieved by something different from
> roles. Something like node labels (user defined) which can then be used in
> a replica placement plugin to assign replicas. I see roles as more closely
> associated with kinds of functionality a node is designated for. Therefore,
> I feel that replica placements and user defined node labels is out of scope
> for this SIP. It can be added later in a separate SIP, without being at
> odds with this proposal.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> On Mon, Nov 1, 2021 at 8:42 PM Jan Høydahl <
> [email protected]> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> > 1. nov. 2021 kl. 14:46 skrev Ilan Ginzburg <
> [email protected]>:
> >> >>>>>>>>>> > A node not having node.roles defined should be assumed to
> have all roles. Not only data. I don't see a reason to special case this
> one or any role.
> >> >>>>>>>>>>
> >> >>>>>>>>>> +1, make it simple and transparent. No role == all roles.
> Explicit list of roles = exactly those roles.
> >> >>>>>>>>>>
> >> >>>>>>>>>> > (Gus) See my comment above, but maybe preference is
> something handled as a feature of the role rather than via role designation?
> >> >>>>>>>>>>
> >> >>>>>>>>>> Yea, we always need an overseer, so that feature can decide
> to use its list of nodes as a preference if it so chooses.
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Aside: I think it makes it easier if we always prefix Solr
> env.vars and sys.props with "SOLR_" or "solr.", i.e. -Dsolr.node.roles=foo.
> That way we can get away from having to have explicit code in bin/solr,
> bin/solr.cmd and SolrCLI to manage every single property. Instead we can
> parse all ENVs and Props with the solr prefix in our bootstrap code. And we
> can by convention allow e.g. docker run -e SOLR_NODE_ROLES=foo solr:9 and
> it would be the same ting...
> >> >>>>>>>>>>
> >> >>>>>>>>>> Jan
> >> >>>>>>>>>>
> ---------------------------------------------------------------------
> >> >>>>>>>>>> To unsubscribe, e-mail: [email protected]
> >> >>>>>>>>>> For additional commands, e-mail: [email protected]
> >> >>>>>>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> http://www.needhamsoftware.com (work)
> >> >>>>>> http://www.the111shift.com (play)
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> http://www.needhamsoftware.com (work)
> >> >>>> http://www.the111shift.com (play)
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: First class support for node roles

Reply via email to