+1 to share code for doing 1) and 3) both of which are tricky!

Safely moving / copying bytes around is a notoriously difficult problem ...
but Lucene's "end to end checksums" and per-segment-file-GUID make this
safer.

I think Lucene's replicator module is a good place for this?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 3, 2019 at 4:15 PM Michael Froh <msf...@gmail.com> wrote:

> Hi there,
>
> I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
> storing and retrieving Lucene indexes in S3, and realized that "uploading a
> Lucene directory to the cloud and downloading it on other machines" is a
> pretty common problem and one that's surprisingly easy to do poorly. In my
> current job, I'm on my third team that needed to do this.
>
> In my experience, there are three main pieces that need to be implemented:
>
> 1. Uploading/downloading individual files (i.e. the blob store), which can
> be eventually consistent if you write once.
> 2. Describing the metadata for a specific commit point (basically what the
> Replicator module does with the "Revision" class). In particular, we want a
> downloader to reliably be able to know if they already have specific files
> (and don't need to download them again).
> 3. Sharing metadata with some degree of consistency, so that multiple
> writers don't clobber each other's metadata, and so readers can discover
> the metadata for the latest commit/revision and trust that they'll
> (eventually) be able to download the relevant files.
>
> I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
> I'd like to do it with  interfaces that lend themselves to other
> implementations for blob and metadata storage.
>
> Is it worth opening a Jira issue for this? Is this something that would
> benefit the Lucene community?
>
> Thanks,
> Michael Froh
>

Reply via email to