[
https://issues.apache.org/jira/browse/SOLR-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley resolved SOLR-15051.
---------------------------------
Resolution: Won't Fix
Maybe Bruno could put it on his fork as he crafted all or nearly all that
exists.
This is debatable but I'm closing this now as won't-fix. The title of the
issue contains "de-duping" and the proposed way to achieve this specifically is
(a) rather complicated, and (b) hasn't been started, and moreover (c) is an
approach I no longer want to put any energy into because I think I'd rather go
about things differently. I instead favor an approach where only the leader
writes to remote storage that is one place for the shard, not per-replica, and
thus there is no duplication concern. It would be some other issue/(s),
perhaps SOLR-6237.
I really like what we've developed here so far; it doesn't have nor _need_
de-duping to be useful. It's backend, BackupRepository, is an abstraction Solr
already has with a growing number of implementations. We could deliver just
this by itself, and do so under a new JIRA issue. I'm not sure of known
bugs/gaps; ideally we'd have time to push it past a finish line and compare its
performance to HdfsDirectory, which I'm very curious about.
> Shared storage -- BlobDirectory (de-duping)
> -------------------------------------------
>
> Key: SOLR-15051
> URL: https://issues.apache.org/jira/browse/SOLR-15051
> Project: Solr
> Issue Type: Improvement
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Time Spent: 3h 50m
> Remaining Estimate: 0h
>
> This proposal is a way to accomplish shared storage in SolrCloud with a few
> key characteristics: (A) using a Directory implementation, (B) delegates to a
> backing local file Directory as a kind of read/write cache (C) replicas have
> their own "space", (D) , de-duplication across replicas via reference
> counting, (E) uses ZK but separately from SolrCloud stuff.
> The Directory abstraction is a good one, and helps isolate shared storage
> from the rest of SolrCloud that doesn't care. Using a backing normal file
> Directory is faster for reads and is simpler than Solr's HDFSDirectory's
> BlockCache. Replicas having their own space solves the problem of multiple
> writers (e.g. of the same shard) trying to own and write to the same space,
> and it implies that any of Solr's replica types can be used along with what
> goes along with them like peer-to-peer replication (sometimes faster/cheaper
> than pulling from shared storage). A de-duplication feature solves needless
> duplication of files across replicas and from parent shards (i.e. from shard
> splitting). The de-duplication feature requires a place to cache directory
> listings so that they can be shared across replicas and atomically updated;
> this is handled via ZooKeeper. Finally, some sort of Solr daemon /
> auto-scaling code should be added to implement "autoAddReplicas", especially
> to provide for a scenario where the leader is gone and can't be replicated
> from directly but we can access shared storage.
> For more about shared storage concepts, consider looking at the description
> in SOLR-13101 and the linked Google Doc.
> *[PROPOSAL
> DOC|https://docs.google.com/document/d/1kjQPK80sLiZJyRjek_Edhokfc5q9S3ISvFRM2_YeL8M/edit?usp=sharing]*
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]