The changing/overlapping leaders was the main challenge in the
implementation.
Logic such as:
If (iAmLeader()) {
   doThings();
}
Can have multiple participants doThings() at the same time as iAmLeader()
could change just after it was checked. The only way out in such an
approach is to do barriers (the old leader explicitly giving up leadership
before the new one takes over… sounds familiar? 😉). This is complicated.
What if the old leader is considered gone and can’t explicitly give up but
is not gone? Not being seen by the quorum does not automatically imply not
being able to write to S3.

We solved it by having the writing to S3 (or indeed any storage, we added
an abstraction layer) use random file names (Solr file name + random
suffix) so that two concurrent nodes would not overwrite each other even if
they were writing similarly named segments/files.

Then we used a conditional update in Zookeeper (on per write of one or more
segments, not one per file) to have one of the two nodes “win” the write to
S3. The data written by the losing node is ignored and not part of the S3
image of the shard.

Indeed running Solr from the local disk is essential (the cache aspect).
Two orders of magnitude more space than in memory, more or less.

And we run smaller shard sizes indeed!

Thanks everybody for the feedback so far.

Ilan

On Sat 29 Apr 2023 at 07:08, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/28/23 11:33, Ilan Ginzburg wrote:
> > Salesforce has been working for a while on separating compute from
> storage
> > in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public
> Cloud:
> > Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A
> >.
> > In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
> > local disk cache of the cores they're hosting but no persistent volumes
> (no
> > persistent indexes nor transaction logs), and shard level persistence is
> > done on S3.
>
> This is a very intriguing idea!  I think it would be particularly useful
> for containerized setups that can add or remove nodes to meet changing
> demands.
>
> My primary concern when I first looked at this was that with
> network-based storage there would be little opportunity for caching, and
> caching is SUPER critical for Solr performance.  Then when I began
> writing this reply, I saw above that you're talking about a local disk
> cache... so maybe that is not something to worry about.
>
> Bandwidth and latency limitations to/from the shared storage are another
> concern, especially with big indexes that have segments up to 5GB.
> Increasing the merge tier sizes and reducing the max segment size is
> probably a very good idea.
>
> Another challenge:  Ensuring that switching leaders happens reasonably
> quickly while making sure that there cannot be multiple replicas
> thinking they are leader at the same time.  Making the leader fencing
> bulletproof is a critical piece of this.  I suspect that the existing
> leader fencing could use some work, affecting SolrCloud in general.
>
> I don't want to get too deep in technical weeds, mostly because I do not
> understand all the details ... but I am curious about something that
> might affect this:  Are ephemeral znodes created by one Solr node
> visible to other Solr nodes?  If they are, then I think ZK would provide
> all the fencing needed, and could also keep track of the segments that
> exist in the remote storage so follower replicas can quickly keep up
> with the leader.
>
> There could also be implementations for more mundane shared storage like
> SMB or NFS.
>
> Thanks,
> Shawn
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>

Reply via email to