edit: 6. (+Gus) Providing be evidenced by a the node *adding itself to a list* of ephemeral nodes (similar to live_nodes) for each role
On Fri, Oct 29, 2021 at 9:40 AM Gus Heck <[email protected]> wrote: > I've heard a number of folks agree that we should not have negative (role > removal) values for roles (!data in the sip). > > I also don't like the idea of the "coordinator" creating assumptions about > other roles. I think the point of avoiding "!data" is to make it > programmatically and logically easy to tell what role a node has, if we > have to have a method called figureOutImpliedRoles() with a lot of logic in > it that's bad. It should just be getRoles().contains(role), trivially > returning the roles that are already declared in config/zk/whatever. > > We don't have to support every possible role all at once. We can have > "basic functionality" that all nodes provide regardless of roles (right now > that's everything), and then lop off chunks of basic functionality and > assign them to roles. That should be easy and backward compatible if we > then give the new role to every node by default on upgrade. > > However we should carefully think about what should and shouldn't be part > of any role, because moving functionality out of a role back to basic > functionality or between roles will create backwards compatibility issues. > This is why I think we should have a concept of what roles we will have in > the future, so we don't inadvertently move functionality into a role that > later needs to go in some other role (mistakes/bugs may happen of course, > but best effort). > > So boiling it down I've seen suggestion for the following additions/edits > to the SIP: > > 1. (+Gus, +Houston,+Ilan) Positive roles, the existence of which > implies functionality such that if a node can provide functionality. i.e. > it always has the role if it can and if it doesn't have the role it can't > provide the functionality. > 2. (+Houston,+Ishan,+Gus - below) Rename query role > 3. (+Gus) We should include a plan for the overall set of roles to > work towards and then build them out as time allows us to. > 4. (+Gus) We have a distinction between "capable" and "currently > providing" > 5. (+Gus) Capable be evidenced by a config/startup designation that > adds a list of roles to a json file in zk where the nodes are all listed > 6. (+Gus) Providing be evidenced by the node adding an list of > ephemeral nodes (similar to live_nodes) for each role > 7. (+Ilan, +Gus) Making collections role aware > > Ilan suggested that we make collections role-aware which would make some > sense since the collection might want to have a minimum of 2 > query-aggregator nodes available, might want to avoid zk nodes, etc. I > think that this is a good next feature and the intention should be added to > the SIP, but need not be in the initial implementation since by default > everything can have all roles (roles implemented to date) and initially > removing roles from nodes will be an advanced/manual feature mostly > applicable to static clusters that don't add collections regularly, then > support for role aware collections can be added to make the feature useful > for a wider audience (should be its own ticket anyway, and it interacts > with replica placement). > > I've heard several agree with #1, and it seems 3-6 were either not yet > clear or folks are still deliberating as I haven't noticed positive or > negative opinions there, just some discussion of the definition of > candidate roles. I'm fond of 3-5 because it allows for things like knowing > what the capabilities of a down node are, and finding a provider without > having to cross-coordinate with live_nodes. (keeps code simple, avoids > racing between the check for liveness and the check for the capability) > Also, a node joining as live and able to serve queries can be decoupled > from when it's ready to provide a service (thinking at least zk here, > waiting for a 2nd node capable of zk before expanding the zk cluster to > avoid even numbered clusters). > > Ishan, specifics on how your coordinator node would work would be > interesting to know if it really is distinct from my concept of a "query" > node. I agree that that term is probably confusing, I used it to mean > "query parsing" you meant it as "query aggregator". > > As a side note, with positive only roles and all roles added unless > specified otherwise, Ishan's use case might be as simple as just removing > the DATA role from a few nodes and restricting the aggregation queries > concerned to those nodes. To get solr to enforce the restriction for you, > then a "query/compute/coordinator" role must be removed from the remainder > of the nodes. > > -Gus > > On Fri, Oct 29, 2021 at 5:49 AM Ishan Chattopadhyaya < > [email protected]> wrote: > >> > I'll introduce a change to the SIP document, unless there are >> objections/questions/concerns. WDYT? >> I've made the change to the document. Feedback on this welcome. >> >> On Fri, Oct 29, 2021 at 2:52 PM Ishan Chattopadhyaya < >> [email protected]> wrote: >> >>> It seems to me, after going through this thread, that the role "query" >>> is misleading for what we're planning to introduce in SOLR-15715. We're >>> essentially introducing a functionality for performing, what we initially >>> called, "query aggregations". The actual queries run on data nodes anyway, >>> just that the first point of entry for such distributed queries will be a >>> separate node with this extra functionality. Towards that, I feel we should >>> call a node with such capability as a "coordinator" node (instead of "query >>> node"). It fits nicely with the mental model of ElasticSearch as well: >>> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html#coordinating-node >>> . >>> >>> Proposing that if a node has a role "coordinator", then that node is >>> already assumed to have no data replicas on it. If a node starts with roles >>> "coordinator,data" both, then the startup should fail with a message saying >>> a coordinator node should not host data on it. A coordinator node, though, >>> can have other roles like "zookeeper" or "overseer" etc. >>> >>> I'll introduce a change to the SIP document, unless there are >>> objections/questions/concerns. WDYT? >>> >>> >>> >>> On Fri, Oct 29, 2021 at 1:54 PM Ilan Ginzburg <[email protected]> >>> wrote: >>> >>>> If we make collections role-aware for example (replicas of that >>>> collection can only be placed on nodes with a specific role, in addition to >>>> the other role based constraints), the set of roles should be user >>>> extensible and not fixed. >>>> >>>> If collections are not role aware, the constraints introduced by roles >>>> apply to all collections equally which might be insufficient if a user >>>> needs for example a heavily used collection to only be placed on more >>>> powerful nodes. >>>> >>>> Ilan >>>> >>>> On Thu 28 Oct 2021 at 07:59, Gus Heck <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Oct 27, 2021 at 3:34 PM Houston Putman < >>>>> [email protected]> wrote: >>>>> >>>>>> I don't think it's unreasonable to want to have nodes that don't >>>>>>> accept queries. This is just ishan's use case. >>>>>> >>>>>> >>>>>> Maybe I am misunderstanding, and it does deal with your last point >>>>>> about inter-node communication, but Peer-sync uses queries when doing >>>>>> replication between replicas. If a node doesn't have queries enabled, >>>>>> then >>>>>> it's possible to break peer sync. There are other options to make sure >>>>>> certain replicas aren't queried (shards.preference). >>>>>> For the separation of update/query traffic, I understand having >>>>>> compute nodes that deal with pre-replica commands, such as managing >>>>>> distributed queries or pre-processing documents in the URP chain. But for >>>>>> actual non-distrib queries and final update requests, the only way to >>>>>> actually separate this traffic is using PULL/TLOG replicas, because >>>>>> otherwise (with NRT) all update requests are still going to the query >>>>>> nodes, just the same as non-query nodes that are hosting those replicas. >>>>>> (and shard leadership could go to any "data" node, since I imagine we >>>>>> wouldn't filter out the "query" nodes...) The shards.preference option >>>>>> makes it easy to send queries to only PULL replicas in this scenario. >>>>>> That's why I saw this more as a "compute" role rather than "query". >>>>>> >>>>> >>>>> Yeah for internal stuff we still need the ability to query so we might >>>>> need to accommodate that that, but you may not have noticed the bit where >>>>> I >>>>> mentioned Query nodes doing the parsing/analysis of the query and then >>>>> sending a fully parsed (possibly serialized lucene objects) query to the >>>>> data node. So data nodes would only speak a single lucene level query >>>>> language and not parse queries or analyze text. In theory, with that, one >>>>> could even have something like elastic reduce a request to lucene objects >>>>> and then get an answer from a solr data node (for simple cases without >>>>> need >>>>> to find shards via zookeeper etc) certainly many details and corner cases >>>>> to think about there. >>>>> >>>>> >>>>>> >>>>>> Definitely not what I would like. If I'm going to try to segregate >>>>>>> data nodes out to certain nodes, I don't want them appearing elsewhere >>>>>>> just >>>>>>> cause something went down or filled up. Nor would I want updates to >>>>>>> suddenly start bogging down my query nodes.... >>>>>>> >>>>>> >>>>>> I guess it depends on what you are optimizing for. Maybe there can be >>>>>> an option for this. like -DnonLenientRoles=data,update or something like >>>>>> that. >>>>>> >>>>> >>>>> A possibility >>>>> >>>>> >>>>>> >>>>>> On Wed, Oct 27, 2021 at 3:03 PM Gus Heck <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 27, 2021 at 2:44 PM Houston Putman < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> As for the "query" role, let's name it something better like >>>>>>>> "compute", since data nodes are always going to be "querying". >>>>>>>> >>>>>>> >>>>>>> I don't think it's unreasonable to want to have nodes that don't >>>>>>> accept queries. This is just ishan's use case. >>>>>>> >>>>>>> >>>>>>>> if no live nodes have roles=overseer (or roles=all), then we >>>>>>>> should just select any node to be overseer. This should be the same for >>>>>>>> compute, data, etc. >>>>>>>> >>>>>>> >>>>>>> Definitely not what I would like. If I'm going to try to segregate >>>>>>> data nodes out to certain nodes, I don't want them appearing elsewhere >>>>>>> just >>>>>>> cause something went down or filled up. Nor would I want updates to >>>>>>> suddenly start bogging down my query nodes.... >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> So, for the proposal, lets say "data" is a special role which is >>>>>>>>> assumed by default, and is enabled on all nodes unless there's a >>>>>>>>> !data. >>>>>>>>> >>>>>>>> >>>>>>>> Instead of this, maybe we have role groups. Such as >>>>>>>> admin~=overseer,zk or worker~=compute,data,updateProcessing >>>>>>>> >>>>>>> >>>>>>> Roll groups sounds like a next level feature to be built on top once >>>>>>> we figure out how roles work independently. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> As for the suggested Roles, I'm not sure ADMIN or UI really fit, >>>>>>>> since there is another option to disable the UI for a solr node, and >>>>>>>> various ADMIN commands have to be accepted across other node roles. >>>>>>>> (Data >>>>>>>> nodes require the Collections API, same with the overseer.) >>>>>>>> >>>>>>> >>>>>>> I admit I'm angling towards a world in which enabling and disabling >>>>>>> broad features is done in one way in one place... As for admin there >>>>>>> might >>>>>>> be a distinction between commands issued between nodes and from the >>>>>>> outside >>>>>>> world... I have this other idea about inter-node communication that's >>>>>>> even >>>>>>> less baked that I wont distract with here ;) >>>>>>> >>>>>>> >>>>>>>> - Houston >>>>>>>> >>>>>>>> On Wed, Oct 27, 2021 at 1:34 PM Ishan Chattopadhyaya < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> bq. In other words, roles are all "positive", but their >>>>>>>>> consequences are only negative (rejecting when the matching positive >>>>>>>>> role >>>>>>>>> is not present). >>>>>>>>> >>>>>>>>> Essentially, yes. A node that doesn't specify any role should be >>>>>>>>> able to do everything. >>>>>>>>> >>>>>>>>> Let me just take a brief detour and mention our thoughts on the >>>>>>>>> "query" role. While all data nodes can also be used for querying, our >>>>>>>>> idea >>>>>>>>> was to create a layer of nodes that have some special mechanism to be >>>>>>>>> able >>>>>>>>> to proxy/forward queries to data nodes (lets call it "pseudo cores" or >>>>>>>>> "synthetic cores" or "proxy cores". Our thought was that any node >>>>>>>>> that has >>>>>>>>> "query,!data" role would enable this special mode on startup (whereby >>>>>>>>> requests are served by these special pseudo cores). We'll discuss >>>>>>>>> about >>>>>>>>> this in detail in that issue. >>>>>>>>> >>>>>>>>> Back to the main subject here. >>>>>>>>> >>>>>>>>> Lets take a practical scenario: >>>>>>>>> * Layer1: Organization has about 100 nodes, each node has many >>>>>>>>> data replicas >>>>>>>>> * Layer2: To manage such a large cluster reliably, they keep aside >>>>>>>>> 4-5 dedicated overseer nodes. >>>>>>>>> * Layer3: Since query aggregations/coordination can potentially be >>>>>>>>> expensive, they keep aside 5-10 query nodes. >>>>>>>>> >>>>>>>>> My preference would be as follows: >>>>>>>>> * I'd like to refer to Layer1 nodes as the "data nodes" and hence >>>>>>>>> get either no role defined for them or -Dnode.roles=data. >>>>>>>>> * I'd like to refer to Layer2 nodes as "overseer nodes" (even >>>>>>>>> though I understand, only one of them can be an overseer at a time). >>>>>>>>> I'd >>>>>>>>> like to have -Dnode.roles=!data,overseer >>>>>>>>> * I'd like to refer to Layer3 nodes as "query nodes", with >>>>>>>>> -Dnode.roles=!data,query >>>>>>>>> >>>>>>>>> ^ This seems very practical from operational standpoint. >>>>>>>>> >>>>>>>>> So, for the proposal, lets say "data" is a special role which is >>>>>>>>> assumed by default, and is enabled on all nodes unless there's a >>>>>>>>> !data. It >>>>>>>>> is presumed that data nodes can also serve queries directly, so >>>>>>>>> adding a >>>>>>>>> "query" to those nodes is meaningless (also because there's no >>>>>>>>> practical >>>>>>>>> benefit to stopping a data node from receiving a query for "!query" >>>>>>>>> role to >>>>>>>>> be useful). >>>>>>>>> >>>>>>>>> "query" role on nodes that don't host data really refers to a >>>>>>>>> special capability for lightweight, stateless nodes. I don't want to >>>>>>>>> add a >>>>>>>>> "!query" on dedicated overseer nodes, and hence I don't want to >>>>>>>>> assume that >>>>>>>>> "query" is implicitly avaiable on any node even if the role isn't >>>>>>>>> specified. >>>>>>>>> >>>>>>>>> "overseer" role is complicated, since it is already defined and we >>>>>>>>> don't have the opportunity to define it the right way. I'd hate >>>>>>>>> having to >>>>>>>>> put a "!overseer" on every data node on startup in order to have a few >>>>>>>>> dedicated overseers. >>>>>>>>> >>>>>>>>> In short, in this SIP, I just wish to implement the concept of >>>>>>>>> nodes and its handling. How individual roles are leveraged can be up >>>>>>>>> to >>>>>>>>> every new role's implementation. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Oct 27, 2021 at 9:54 PM Gus Heck <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> In other words, roles are all "positive", but their consequences >>>>>>>>>>> are only negative (rejecting when the matching positive role is not >>>>>>>>>>> present). >>>>>>>>>>> >>>>>>>>>>> Yeah right. to do something the machine needs the role >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> We can also consider no role defined = all roles allowed. Will >>>>>>>>>>> make things simpler. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> in terms of startup command yes. Internally we should have all >>>>>>>>>> explicitly assigned when no roles are specified at startup so that >>>>>>>>>> the code >>>>>>>>>> doesn't have a million if checks for the empty case >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Oct 27, 2021 at 6:14 PM Ilan Ginzburg < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> How do we expect the roles to be used? >>>>>>>>>>>> One way I see is a node refusing to do anything related to a >>>>>>>>>>>> role it doesn't have. >>>>>>>>>>>> For example if a node does not have role "data", any attempt to >>>>>>>>>>>> create a core on it would fail. >>>>>>>>>>>> A node not having the role "query", will refuse to have >>>>>>>>>>>> anything to do with handling a query etc. >>>>>>>>>>>> Then it would be up to other code to make sure only the >>>>>>>>>>>> appropriate nodes are requested to do any type of action. >>>>>>>>>>>> So for example any replica placement code plugin would have to >>>>>>>>>>>> restrict the set of candidate nodes for a new replica placement to >>>>>>>>>>>> those >>>>>>>>>>>> having "data". Otherwise the call would fail, and there should be >>>>>>>>>>>> nothing >>>>>>>>>>>> the replica placement code can do about it. >>>>>>>>>>>> >>>>>>>>>>>> Similarly, the "overseer" role would limit the nodes that >>>>>>>>>>>> participate in the Overseer election. The Overseer election code >>>>>>>>>>>> would have >>>>>>>>>>>> to remove (or not add) all non qualifying nodes from the election, >>>>>>>>>>>> and we >>>>>>>>>>>> should expect a node without role "overseer" to refuse to start the >>>>>>>>>>>> Overseer machinery if asked to... >>>>>>>>>>>> >>>>>>>>>>>> Trying to make the use case clear regarding how roles are used. >>>>>>>>>>>> Ilan >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Oct 27, 2021 at 5:47 PM Gus Heck <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Oct 27, 2021 at 9:55 AM Ishan Chattopadhyaya < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Gus, >>>>>>>>>>>>>> >>>>>>>>>>>>>> > I think that we should expand/edit your list of roles to be >>>>>>>>>>>>>> >>>>>>>>>>>>>> The list can be expanded as and when more isolation and >>>>>>>>>>>>>> features are needed. I only listed those roles that we already >>>>>>>>>>>>>> have a >>>>>>>>>>>>>> functionality for or is under development. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Well all of those roles (except zookeeper) are things nodes do >>>>>>>>>>>>> today. As it stands they are all doing all of them. What we add >>>>>>>>>>>>> support for >>>>>>>>>>>>> as we move forward is starting without a role, and add the >>>>>>>>>>>>> zookeeper role >>>>>>>>>>>>> when that feature is ready. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> > I would like to recommend that the roles be all positive >>>>>>>>>>>>>> ("Can do this") and nodes with no role at all are ineligible for >>>>>>>>>>>>>> all >>>>>>>>>>>>>> activities. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It comes down to the defaults and backcompat. If we want all >>>>>>>>>>>>>> Solr nodes to be able to host data replicas by default (without >>>>>>>>>>>>>> user >>>>>>>>>>>>>> explicitly specifying role=data), then we need a way to unset >>>>>>>>>>>>>> this role. >>>>>>>>>>>>>> The most reasonable way sounded like a "!data". We can do away >>>>>>>>>>>>>> with !data >>>>>>>>>>>>>> if we mandate each and every data node have the role "data" >>>>>>>>>>>>>> explicitly >>>>>>>>>>>>>> defined for it, which breaks backcompat and also is cumbersome >>>>>>>>>>>>>> to use for >>>>>>>>>>>>>> those who don't want to use these special roles. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> Not sure I understand, which of the roles I mentioned (other >>>>>>>>>>>>> than zookeeper, which I expect is intended as different from our >>>>>>>>>>>>> current >>>>>>>>>>>>> embedded zk) is NOT currently supported by a single cloud node >>>>>>>>>>>>> brought up >>>>>>>>>>>>> as shown in our tutorials/docs? I'm certainly not proposing that >>>>>>>>>>>>> the >>>>>>>>>>>>> default change to nothing. The default is all roles, unless you >>>>>>>>>>>>> specify >>>>>>>>>>>>> roles at startup. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> > I also suggest that these roles each have a node in >>>>>>>>>>>>>> zookeeper listing the current member nodes (as child nodes) so >>>>>>>>>>>>>> that code >>>>>>>>>>>>>> that wants to find a node with an appropriate role does not need >>>>>>>>>>>>>> to scan >>>>>>>>>>>>>> the list of all nodes parsing something to discover which nodes >>>>>>>>>>>>>> apply and >>>>>>>>>>>>>> also does not have to parse json to do it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> /roles.json exists today, it has role as key and list of >>>>>>>>>>>>>> nodes as value. In the next major version, we can change the >>>>>>>>>>>>>> format of that >>>>>>>>>>>>>> file and use key as node, value as list of roles. Or, maybe we >>>>>>>>>>>>>> can go for >>>>>>>>>>>>>> adding the roles to the data for each item in the list of >>>>>>>>>>>>>> live_nodes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> I'm not finding anything in our documentation about roles.json >>>>>>>>>>>>> so I think it's an internal implementation detail, which reduces >>>>>>>>>>>>> back >>>>>>>>>>>>> compat concerns. ADDROLE/REMOVEROLE don't accept json or anything >>>>>>>>>>>>> like that >>>>>>>>>>>>> and could be made to work with zk nodes too. >>>>>>>>>>>>> >>>>>>>>>>>>> The fact that some precursor work was done without a SIP (or >>>>>>>>>>>>> before SIPs existed) should not hamstring our design once a SIP >>>>>>>>>>>>> that >>>>>>>>>>>>> clearly covers the same topic is under consideration. By their >>>>>>>>>>>>> nature SIP's >>>>>>>>>>>>> are non-trivial and often will include compatibility breaks. Good >>>>>>>>>>>>> news is I >>>>>>>>>>>>> don't think I see one here, just a code change to transition to a >>>>>>>>>>>>> different >>>>>>>>>>>>> zk backend. I think that it's probably a mistake to consider our >>>>>>>>>>>>> zookeeper >>>>>>>>>>>>> data a public API and we should be moving away from that or at >>>>>>>>>>>>> the very >>>>>>>>>>>>> least segregating clearly what in zk is long term reliable. >>>>>>>>>>>>> Ideally our >>>>>>>>>>>>> v1/v2 api's should be the public api through which information >>>>>>>>>>>>> about the >>>>>>>>>>>>> cluster is obtained. Programming directly against zk is kind of >>>>>>>>>>>>> like a >>>>>>>>>>>>> custom build of solr. Sometimes useful and appropriate, but >>>>>>>>>>>>> maintenance is >>>>>>>>>>>>> your concern. For code plugging into solr, it should in theory be >>>>>>>>>>>>> against >>>>>>>>>>>>> an internal information java api, and zookeeper should not be >>>>>>>>>>>>> touched >>>>>>>>>>>>> directly. (I know this is not in a good state or at least wasn't >>>>>>>>>>>>> last time >>>>>>>>>>>>> I looked closely, but it should be where we are heading). >>>>>>>>>>>>> >>>>>>>>>>>>> > any code seeking to transition a node >>>>>>>>>>>>>> >>>>>>>>>>>>>> We considered this situation and realized that it is very >>>>>>>>>>>>>> risky to have nodes change roles while they are up and running. >>>>>>>>>>>>>> Better to >>>>>>>>>>>>>> assign fixed roles upon startup. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I agree that concurrency is hard. I definitely think startup >>>>>>>>>>>>> time assignments should be involved here. I'm not thinking that >>>>>>>>>>>>> every >>>>>>>>>>>>> transition must be supported. As a starting point it would be >>>>>>>>>>>>> fine if none >>>>>>>>>>>>> were. Having something suddenly become zookeeper is probably >>>>>>>>>>>>> tricky to >>>>>>>>>>>>> support (see discussion in that thread regarding nodes not >>>>>>>>>>>>> actually >>>>>>>>>>>>> participating until they have a partner to join with them to >>>>>>>>>>>>> avoid even >>>>>>>>>>>>> numbered clusters), but I think the design should not preclude the >>>>>>>>>>>>> possibility of nodes becoming eligible for some roles or >>>>>>>>>>>>> withdrawing from >>>>>>>>>>>>> some roles, and treatment of roles should be consistent. In some >>>>>>>>>>>>> cases >>>>>>>>>>>>> someone may decide it's worth the work of handling the concurrency >>>>>>>>>>>>> concerns, best if they don't have to break back compat or hack >>>>>>>>>>>>> their code >>>>>>>>>>>>> around the assumption it wouldn't happen to do it. >>>>>>>>>>>>> >>>>>>>>>>>>> Taking the zookeeper case as an example, it very much might be >>>>>>>>>>>>> desirable to have the possibility to heal the zk cluster by >>>>>>>>>>>>> promoting >>>>>>>>>>>>> another node (configured as eligible for zk) to active zk duty if >>>>>>>>>>>>> one of >>>>>>>>>>>>> the current zk nodes has been down long enough (say on prem >>>>>>>>>>>>> hardware, >>>>>>>>>>>>> motherboard pops a capacitor, server gone for a week while new >>>>>>>>>>>>> hardware is >>>>>>>>>>>>> purchased, built and configured). Especially if the down node >>>>>>>>>>>>> didn't hold >>>>>>>>>>>>> data or other nodes had sufficient replicas and the cluster is >>>>>>>>>>>>> still >>>>>>>>>>>>> answering queries just fine. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> > I know of a case that would benefit from having separate >>>>>>>>>>>>>> Query/Update nodes that handle a heavy analysis process which >>>>>>>>>>>>>> would be >>>>>>>>>>>>>> deployed to a number of CPU heavy boxes (which might add more in >>>>>>>>>>>>>> prep for >>>>>>>>>>>>>> bulk indexing, and remove them when bulk was done), data could >>>>>>>>>>>>>> then be >>>>>>>>>>>>>> hosted on cheaper nodes.... >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is the main motivation behind this work. SOLR-15715 >>>>>>>>>>>>>> needs this, and hence it would be good to get this in as soon as >>>>>>>>>>>>>> possible. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I think we can incrementally work towards configurability for >>>>>>>>>>>>> all of these roles. The current default state is that a node has >>>>>>>>>>>>> all roles >>>>>>>>>>>>> and the incremental progress is to enable removing a role from a >>>>>>>>>>>>> node. This >>>>>>>>>>>>> I think is why it might be good to to >>>>>>>>>>>>> >>>>>>>>>>>>> A) Determine the set of roles our current solr nodes are >>>>>>>>>>>>> performing (that might be removed in some scenario) and document >>>>>>>>>>>>> this via >>>>>>>>>>>>> assigning these roles as default on as this SIP goes live. >>>>>>>>>>>>> B) Figure out what the process of adding something entirely >>>>>>>>>>>>> new that we haven't yet thought of with its own role would look >>>>>>>>>>>>> like. >>>>>>>>>>>>> >>>>>>>>>>>>> I think it would be great if we not only satisfied the current >>>>>>>>>>>>> need but determined how we expect this to change over time. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Ishan >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 6:32 PM Gus Heck <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The SIP looks like a good start, and I was already thinking >>>>>>>>>>>>>>> of something very similar to this as a follow on to my attempts >>>>>>>>>>>>>>> to split >>>>>>>>>>>>>>> the uber filter (SolrDispatchFilter) into servlets such that >>>>>>>>>>>>>>> roles >>>>>>>>>>>>>>> determine what servlets are deployed, but I would like to >>>>>>>>>>>>>>> recommend that >>>>>>>>>>>>>>> the roles be all positive ("Can do this") and nodes with no >>>>>>>>>>>>>>> role at all are >>>>>>>>>>>>>>> ineligible for all activities. (just like standard role >>>>>>>>>>>>>>> permissioning >>>>>>>>>>>>>>> systems). This will make it much more familiar and easy to >>>>>>>>>>>>>>> think about. >>>>>>>>>>>>>>> Therefore there would be no need for a role such as !data which >>>>>>>>>>>>>>> I presume >>>>>>>>>>>>>>> was meant to mean "no data on this node"... rather just don't >>>>>>>>>>>>>>> give the >>>>>>>>>>>>>>> "data" role to the node. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Additional node roles I think should exist: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think that we should expand/edit your list of roles to be >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - QUERY - accepts and analyzes queries up to the point >>>>>>>>>>>>>>> of actually consulting the lucene index (useful if you have >>>>>>>>>>>>>>> a very heavy >>>>>>>>>>>>>>> analysis phase) >>>>>>>>>>>>>>> - UPDATE - accepts update requests, and performs update >>>>>>>>>>>>>>> functionality prior to and including >>>>>>>>>>>>>>> DistributedUpdateProcessorFactory >>>>>>>>>>>>>>> (useful if you have a very heavy analysis phase) >>>>>>>>>>>>>>> - ADMIN - accepts admin/management commands >>>>>>>>>>>>>>> - UI - hosts an admin ui >>>>>>>>>>>>>>> - ZOOKEEPER - hosts embedded zookeeper >>>>>>>>>>>>>>> - OVERSEER - performs overseer related functionality >>>>>>>>>>>>>>> (though IIRC there's a proposal to eliminate overseer that >>>>>>>>>>>>>>> might eliminate >>>>>>>>>>>>>>> this) >>>>>>>>>>>>>>> - DATA - nodes where there is a lucene index and >>>>>>>>>>>>>>> matching against the analyzed results of a query may be >>>>>>>>>>>>>>> conducted to >>>>>>>>>>>>>>> generate a response, also performs update steps that come >>>>>>>>>>>>>>> after >>>>>>>>>>>>>>> DistributedUpdateProcesserFactory >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I also suggest that these roles each have a node in >>>>>>>>>>>>>>> zookeeper listing the current member nodes (as child nodes) so >>>>>>>>>>>>>>> that code >>>>>>>>>>>>>>> that wants to find a node with an appropriate role does not >>>>>>>>>>>>>>> need to scan >>>>>>>>>>>>>>> the list of all nodes parsing something to discover which nodes >>>>>>>>>>>>>>> apply and >>>>>>>>>>>>>>> also does not have to parse json to do it. I think this will be >>>>>>>>>>>>>>> particularly key for zookeeper nodes which might be 3 out of >>>>>>>>>>>>>>> 100 or more >>>>>>>>>>>>>>> nodes. Similar to how we track live nodes. I think we should >>>>>>>>>>>>>>> have a >>>>>>>>>>>>>>> nodes.json too that tracks what roles a node is ALLOWED to take >>>>>>>>>>>>>>> (as opposed >>>>>>>>>>>>>>> to which roles it currently servicing) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So running code consults the zookeeper role list of nodes, >>>>>>>>>>>>>>> and any code seeking to transition a node (an admin operation >>>>>>>>>>>>>>> with much >>>>>>>>>>>>>>> lower performance requirements) consults the json data in the >>>>>>>>>>>>>>> nodes.json >>>>>>>>>>>>>>> node, parses it, finds the node in question and checks what >>>>>>>>>>>>>>> it's eligible >>>>>>>>>>>>>>> for (this will correspond to which servlets/apps have been >>>>>>>>>>>>>>> loaded). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I know of a case that would benefit from having separate >>>>>>>>>>>>>>> Query/Update nodes that handle a heavy analysis process which >>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>> deployed to a number of CPU heavy boxes (which might add more >>>>>>>>>>>>>>> in prep for >>>>>>>>>>>>>>> bulk indexing, and remove them when bulk was done), data could >>>>>>>>>>>>>>> then be >>>>>>>>>>>>>>> hosted on cheaper nodes.... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also maybe think about how this relates to NRT/TLOG/PULL >>>>>>>>>>>>>>> which are also maybe role like >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> WDYT? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Gus >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Oct 27, 2021 at 3:17 AM Ishan Chattopadhyaya < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here's an SIP for introducing the concept of node roles: >>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-15694 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/SOLR/SIP-15+Node+roles >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We also wish to add first class support for Query nodes >>>>>>>>>>>>>>>> that are used to process user queries by forwarding to data >>>>>>>>>>>>>>>> nodes, >>>>>>>>>>>>>>>> merging/aggregating them and presenting to users. This concept >>>>>>>>>>>>>>>> exists as >>>>>>>>>>>>>>>> first class citizens in most other search engines. This is a >>>>>>>>>>>>>>>> chance for >>>>>>>>>>>>>>>> Solr to catch up. >>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/SOLR-15715 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> Ishan / Noble / Hitesh >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>>>>>>>> http://www.the111shift.com (play) >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>>>>>> http://www.the111shift.com (play) >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> http://www.needhamsoftware.com (work) >>>>>>>>>> http://www.the111shift.com (play) >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> http://www.needhamsoftware.com (work) >>>>>>> http://www.the111shift.com (play) >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> http://www.needhamsoftware.com (work) >>>>> http://www.the111shift.com (play) >>>>> >>>> > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)
