+1 to share code for doing 1) and 3) both of which are tricky! Safely moving / copying bytes around is a notoriously difficult problem ... but Lucene's "end to end checksums" and per-segment-file-GUID make this safer.
I think Lucene's replicator module is a good place for this? Mike McCandless http://blog.mikemccandless.com On Wed, Jul 3, 2019 at 4:15 PM Michael Froh <msf...@gmail.com> wrote: > Hi there, > > I was talking with Varun at Berlin Buzzwords a couple of weeks ago about > storing and retrieving Lucene indexes in S3, and realized that "uploading a > Lucene directory to the cloud and downloading it on other machines" is a > pretty common problem and one that's surprisingly easy to do poorly. In my > current job, I'm on my third team that needed to do this. > > In my experience, there are three main pieces that need to be implemented: > > 1. Uploading/downloading individual files (i.e. the blob store), which can > be eventually consistent if you write once. > 2. Describing the metadata for a specific commit point (basically what the > Replicator module does with the "Revision" class). In particular, we want a > downloader to reliably be able to know if they already have specific files > (and don't need to download them again). > 3. Sharing metadata with some degree of consistency, so that multiple > writers don't clobber each other's metadata, and so readers can discover > the metadata for the latest commit/revision and trust that they'll > (eventually) be able to download the relevant files. > > I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but > I'd like to do it with interfaces that lend themselves to other > implementations for blob and metadata storage. > > Is it worth opening a Jira issue for this? Is this something that would > benefit the Lucene community? > > Thanks, > Michael Froh >