Hi Radu Thanks for the reply - I'm starting to look that way myself, to create a different collection for each set of data, that way I can control more easily the scaling on each collection, eg to increase replication factor on those that will be queried more. I was looking at Category Routed Alias, but that seems to have quite a few gotchas:
* Can't restrict the collections queried - even if you specify the exact collections to query, eg "collections=items__CRA__2020" (which exists) returns no results. Even when querying the underlying collection and specifying its name returns no results. I only get results with collections=items__CRA - its as if the underlying collection thinks its name really is "items__CRA" rather than "items__CRA__2020" * Some problems with indexing to a new category, I get errors the first time a category is encountered. Looks like it might be manually setup and managed collections and aliases for now. Cheers Tom On Mon, Jun 8, 2020 at 12:43 PM Radu Gheorghe <radu.gheor...@sematext.com> wrote: > > Hi Tom, > > To your last two questions, I'd like to vent an alternative design: have > dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot > tier, and 2019, 2018,2017+lists go to the warm tier. > > Then you can scale the hot tier based on your query load. For the warm > tier, I assume there will be less need for scaling, and if it is, I guess > it's less important for shards of each index to be perfectly balanced (so a > simple "make sure cores are evenly distributed" should be enough). > > Granted, this design isn't as flexible as the one you suggested, but it's > simpler. So simple that I've seen it done without autoscaling (just a few > scripts from when you add nodes in each tier). > > Best regards, > Radu > > https://sematext.com > > vin., 5 iun. 2020, 21:59 Tom Evans <tevans...@googlemail.com.invalid> a > scris: > > > Hi > > > > I'm trying to get a handle on the newer auto-scaling features in Solr. > > We're in the process of upgrading an older SolrCloud cluster from 5.5 > > to 8.5, and re-architecture it slightly to improve performance and > > automate operations. > > > > If I boil it down slightly, currently we have two collections, "items" > > and "lists". Both collections have just one shard. We publish new data > > to "items" once each day, and our users search and do analysis on > > them, whilst "lists" contains NRT user-specified collections of ids > > from items, which we join to from "items" in order to allow them to > > restrict their searches/analysis to just docs in their curated lists. > > > > Most of our searches have specific date ranges in them, usually only > > from the last 3 years or so, but sometimes we need to do searches > > across all the data. With the new setup, we want to: > > > > * shard by date (year) to make the hottest data available in smaller shards > > * have more nodes with these shards than we do of the older data. > > * be able to add/remove nodes predictably based upon our clients > > (predictable) query load > > * use TLOG for "items" and NRT for "lists", to avoid unnecessary > > indexing load for "items" and have NRT for "lists". > > * spread cores across two AZ > > > > With that in mind, I came up with a bunch of simplified rules for > > testing, with just 4 shards for "items": > > > > * "lists" collection has one NRT replica on each node > > * "items" collection shard 2020 has one TLOG replica on each node > > * "items" collection shard 2019 has one TLOG replica on 75% of nodes > > * "items" collection shards 2018 and 2017 each have one TLOG replica > > on 50% of nodes > > * all shards have at least 2 replicas if number of nodes > 1 > > * no node should have 2 replicas of the same shard > > * number of cores should be balanced across nodes > > > > Eg, with 1 node, I want to see this topology: > > A: items: 2020, 2019, 2018, 2017 + lists > > > > with 2 nodes: > > A: items: 2020, 2019, 2018, 2017 + lists > > B: items: 2020, 2019, 2018, 2017 + lists > > > > and if I add two more nodes: > > A: items: 2020, 2019, 2018 + lists > > B: items: 2020, 2019, 2017 + lists > > C: items: 2020, 2019, 2017 + lists > > D: items: 2020, 2018 + lists > > > > To the questions: > > > > * The type of replica created when nodeAdded is triggered can't be set > > per collection. Either everything gets NRT or everything gets TLOG. > > Even if I specify nrtReplicas=0 when creating a collection, nodeAdded > > will add NRT replicas if configured that way. > > * I'm having difficulty expressing these rules in terms of a policy - > > I can't seem to figure out a way to specify the number of replicas for > > a shard based upon the total number of nodes. > > * Is this beyond the current scope of autoscaling triggers/policies? > > Should I instead use the trigger with a custom plugin action (or to > > trigger a web hook) to be a bit more intelligent? > > * Am I wasting my time trying to ensure there are more replicas of the > > hotter shards than the colder shards? It seems to add a lot of > > complexity - should I just instead think that they aren't getting > > queried much, so won't be using up cache space that the hot shards > > will be using. Disk space is pretty cheap after all (total size for > > "items" + "lists" is under 60GB). > > > > Cheers > > > > Tom > >