Hi,

This is a long message, apologies. If responses are positive, there will
likely be plenty of other opportunities to discuss the topics mentioned
here.

I'm trying to see if the community would be interested in a contribution
allowing SolrCloud nodes to be totally stateless with persistent storage
done in S3 (or similar).

Salesforce has been working for a while on separating compute from storage
in SolrCloud,  see presentation at Activate 2019 SolrCloud in Public Cloud:
Scaling Compute Independently from Storage <https://youtu.be/6fE5KvOfb6A>.
In a nutshell, the idea is that all SolrCloud nodes are stateless, have a
local disk cache of the cores they're hosting but no persistent volumes (no
persistent indexes nor transaction logs), and shard level persistence is
done on S3.

This is different from a classic node remote storage:
- No "per node" transaction log (therefore no need to wait for down nodes
to come back up to recover their transaction log data)
- A single per shard copy on S3 (not a copy per replica)
- Local non persistent node volumes (SSD's for example) are a cache and are
used to serve queries and do indexing (therefore local cache is not limited
by memory like it is in HdfsDirectory for example but by disk size)
- When adding a node and replicas, the index content for the replicas is
fetched from S3, not from other (possibly overloaded already) nodes

The current state of our code is:
- It's a fork (it was shared as a branch for a while but eventually removed)
- We introduced a new replica type that is explicitly based on remote Blob
storage (and otherwise similar to TLOG) and that guards against concurrent
writes to S3 (when shard leadership changes in SolrCloud, there can be an
overlap period with two nodes acting as leaders which can cause corruption
with a naive implementation of remote storage at shard level)
- Currently we only push constructed segments to S3 which forces committing
every batch before acknowledging it to the client (we can't rely on the
transaction log on a node given it would be lost upon node restart, the
node having no persistent storage).

We're considering improving this approach by making the transaction log a
shard level abstraction (rather than a replica/node abstraction), and store
it in S3 as well with a transaction log per shard, not per replica.
This would allow indexing to not commit on every batch, speed up /update
requests, push the constructed segments asynchronously to S3, guarantee
data durability while still allowing nodes to be stateless (so can be shut
down at any time in any number without data loss and without having to
restart these nodes to recover data only they can access).

If there is interest in such a contribution to SolrCloud then we might
carry the next phase of the work upstream (initially in a branch, with the
intention to eventually merge).

If there is no interest, it's easier for us to keep working on our
fork that allows taking shortcuts and ignoring features of SolrCloud we do
not use (for example replacing existing transaction log with a shard level
transaction log rather than maintaining both).

Feedback welcome! I can present the idea in more detail in a future
committer/contributor meeting if there's interest.

Thanks,
Ilan

Reply via email to