Hi there, I was talking with Varun at Berlin Buzzwords a couple of weeks ago about storing and retrieving Lucene indexes in S3, and realized that "uploading a Lucene directory to the cloud and downloading it on other machines" is a pretty common problem and one that's surprisingly easy to do poorly. In my current job, I'm on my third team that needed to do this.
In my experience, there are three main pieces that need to be implemented: 1. Uploading/downloading individual files (i.e. the blob store), which can be eventually consistent if you write once. 2. Describing the metadata for a specific commit point (basically what the Replicator module does with the "Revision" class). In particular, we want a downloader to reliably be able to know if they already have specific files (and don't need to download them again). 3. Sharing metadata with some degree of consistency, so that multiple writers don't clobber each other's metadata, and so readers can discover the metadata for the latest commit/revision and trust that they'll (eventually) be able to download the relevant files. I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but I'd like to do it with interfaces that lend themselves to other implementations for blob and metadata storage. Is it worth opening a Jira issue for this? Is this something that would benefit the Lucene community? Thanks, Michael Froh