Re: Getting to grips with auto-scaling

Tom Evans Tue, 09 Jun 2020 11:39:49 -0700

Hi Radu

Thanks for the reply - I'm starting to look that way myself, to create
a different collection for each set of data, that way I can control
more easily the scaling on each collection, eg to increase replication
factor on those that will be queried more. I was looking at Category
Routed Alias, but that seems to have quite a few gotchas:


* Can't restrict the collections queried - even if you specify the
exact collections to query, eg "collections=items__CRA__2020" (which
exists) returns no results. Even when querying the underlying
collection and specifying its name returns no results. I only get
results with collections=items__CRA - its as if the underlying
collection thinks its name really is "items__CRA" rather than
"items__CRA__2020"
* Some problems with indexing to a new category, I get errors the
first time a category is encountered.

Looks like it might be manually setup and managed collections and
aliases for now.

Cheers

Tom

On Mon, Jun 8, 2020 at 12:43 PM Radu Gheorghe
<radu.gheor...@sematext.com> wrote:
>
> Hi Tom,
>
> To your last two questions, I'd like to vent an alternative design: have
> dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot
> tier, and 2019, 2018,2017+lists go to the warm tier.
>
> Then you can scale the hot tier based on your query load. For the warm
> tier, I assume there will be less need for scaling, and if it is, I guess
> it's less important for shards of each index to be perfectly balanced (so a
> simple "make sure cores are evenly distributed" should be enough).
>
> Granted, this design isn't as flexible as the one you suggested, but it's
> simpler. So simple that I've seen it done without autoscaling (just a few
> scripts from when you add nodes in each tier).
>
> Best regards,
> Radu
>
> https://sematext.com
>
> vin., 5 iun. 2020, 21:59 Tom Evans <tevans...@googlemail.com.invalid> a
> scris:
>
> > Hi
> >
> > I'm trying to get a handle on the newer auto-scaling features in Solr.
> > We're in the process of upgrading an older SolrCloud cluster from 5.5
> > to 8.5, and re-architecture it slightly to improve performance and
> > automate operations.
> >
> > If I boil it down slightly, currently we have two collections, "items"
> > and "lists". Both collections have just one shard. We publish new data
> > to "items" once each day, and our users search and do analysis on
> > them, whilst "lists" contains NRT user-specified collections of ids
> > from items, which we join to from "items" in order to allow them to
> > restrict their searches/analysis to just docs in their curated lists.
> >
> > Most of our searches have specific date ranges in them, usually only
> > from the last 3 years or so, but sometimes we need to do searches
> > across all the data. With the new setup, we want to:
> >
> > * shard by date (year) to make the hottest data available in smaller shards
> > * have more nodes with these shards than we do of the older data.
> > * be able to add/remove nodes predictably based upon our clients
> > (predictable) query load
> > * use TLOG for "items" and NRT for "lists", to avoid unnecessary
> > indexing load for "items" and have NRT for "lists".
> > * spread cores across two AZ
> >
> > With that in mind, I came up with a bunch of simplified rules for
> > testing, with just 4 shards for "items":
> >
> > * "lists" collection has one NRT replica on each node
> > * "items" collection shard 2020 has one TLOG replica on each node
> > * "items" collection shard 2019 has one TLOG replica on 75% of nodes
> > * "items" collection shards 2018 and 2017 each have one TLOG replica
> > on 50% of nodes
> > * all shards have at least 2 replicas if number of nodes > 1
> > * no node should have 2 replicas of the same shard
> > * number of cores should be balanced across nodes
> >
> > Eg, with 1 node, I want to see this topology:
> > A: items: 2020, 2019, 2018, 2017 + lists
> >
> > with 2 nodes:
> > A: items: 2020, 2019, 2018, 2017 + lists
> > B: items: 2020, 2019, 2018, 2017 + lists
> >
> > and if I add two more nodes:
> > A: items: 2020, 2019, 2018 + lists
> > B: items: 2020, 2019, 2017 + lists
> > C: items: 2020, 2019, 2017 + lists
> > D: items: 2020, 2018 + lists
> >
> > To the questions:
> >
> > * The type of replica created when nodeAdded is triggered can't be set
> > per collection. Either everything gets NRT or everything gets TLOG.
> > Even if I specify nrtReplicas=0 when creating a collection, nodeAdded
> > will add NRT replicas if configured that way.
> > * I'm having difficulty expressing these rules in terms of a policy -
> > I can't seem to figure out a way to specify the number of replicas for
> > a shard based upon the total number of nodes.
> > * Is this beyond the current scope of autoscaling triggers/policies?
> > Should I instead use the trigger with a custom plugin action (or to
> > trigger a web hook) to be a bit more intelligent?
> > * Am I wasting my time trying to ensure there are more replicas of the
> > hotter shards than the colder shards? It seems to add a lot of
> > complexity - should I just instead think that they aren't getting
> > queried much, so won't be using up cache space that the hot shards
> > will be using. Disk space is pretty cheap after all (total size for
> > "items" + "lists" is under 60GB).
> >
> > Cheers
> >
> > Tom
> >

Re: Getting to grips with auto-scaling

Reply via email to