Re: NRT segment replication in AWS

Steven Schlansker Sun, 02 Mar 2025 16:31:02 -0800

On Sun, Mar 2, 2025 at 7:21 AM Marc Davenport
<madavenp...@cargurus.com.invalid> wrote:
>
> Thank you for the great replies everyone!
> I'm going to be mulling this over for a bit.
>
> @Steven - So in your system it sounds like you still transferred bits
> directly between the primary and the replicas? If you don't mind me asking,
> how many replicas did you have?


We have a pretty small installation - only 3 replicas. We figured we'd
start there and scale up
as needed but honestly we've found that serving our data scales very
well even from a small
number of machines since it all fits in resident memory. We have
relatively few and relatively high-"value"
queries so YMMV.

>
> @Michael - That second simpler architecture is very similar to what we are
> considering; With the exception of a queue for announcing new
> segments rather than a polling process. It is good to know that it's a
> reasonable outline.   You were very latency sensitive. Is there anything
> you can share around the most important specs of your containers, pods, or
> even nodes?  While we run on these EC2 instances, there is a push to get us
> into k8s.   Did you have any issues as you migrated from one version of
> lucene to another.  I'm concerned that our current deployments only allow
> one version of the software in production at any one time.
>
> @Sarthak - I see the term pre-copy all over LuceneServer & nrtSearch but I
> haven't been able to distinguish the term from just "copy". Does the "pre"
> simply refer to the fact that the transfer of bits is happening before the
> replica starts to serve queries from that segment?  I feel like I've been
> missing something when I read that code.
>
> Thank you,
> Marc
>
>
> On Wed, Feb 26, 2025 at 8:21 PM Sarthak Nandi <sarth...@gmail.com> wrote:
>
> >  >  I'm still debating if there should be some direct knowledge of the
> > > replicas in the primary node. Or if the primary node can just churn away
> > > creating base indexes and updates and publish to a queue when it produces
> > a
> > > new set of segments. Then the replicas are then free to pick up the
> > latest
> > > index as they spin up and subscribe to changes for it.
> >
> > In nrtsearch, the primary uses replica information to:
> > 1. publish new NRT point
> > 2. pre-copy merged segments
> >
> > It should be possible to let replicas know of new NRT points using an
> > external queue.
> > Pre-copying merged segments can also be done with a queue, but the replicas
> > would need some way to let the primary know that they are done copying the
> > merged files.
> > If the primary doesn't have knowledge of replicas, some third service would
> > have to keep track of the replicas and let the primary know once all
> > replicas are done pre-copying the segments.
> > Alternatively, you can just skip pre-copying.
> >
> > On Wed, Feb 26, 2025 at 4:31 PM Michael Froh <msf...@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > I'm happy to share some details about how Amazon Product Search does its
> > > segment replication. I haven't worked on Product Search in over three
> > > years, so anything that I remember is not particularly novel. Also, it's
> > > not really secret sauce -- I would have happily talked about it more in
> > the
> > > 2021 re:Invent talk that Mike Sokolov and I did, but we were trying to
> > keep
> > > within our time limit. :)
> > >
> > > That model doesn't exactly have direct communication between primary and
> > > replica (which is generally a good practice in a cloud-based solution --
> > > the fewer node-to-node dependencies, the better). The flow (if I recall
> > > correctly) is driven by a couple of side-car components for the writers
> > and
> > > searchers and is roughly like this:
> > >
> > > 1. At a specified (pretty coarse) interval, the writer side-car calls a
> > > "create checkpoint" API on the writer to ask it to write a checkpoint.
> > > 2. The writer uploads new segment files to S3, and a metadata object
> > > describing the checkpoint contents (which probably includes segments from
> > > an earlier checkpoint, since they can be reused).
> > > 3. The writer returns the S3 URL for the metadata object to its side-car.
> > > 4. The write side-car publishes the metadata URL to "something" -- see
> > > below for details.
> > > 5. The searcher side-cars all read the metadata URL from "something" --
> > see
> > > below for details.
> > > 6. The searcher side-cars each call a "use checkpoint" API on their local
> > > searchers
> > > 7. The searchers each download the new segment files from S3 and open new
> > > IndexSearchers.
> > >
> > > For the details of steps 4 and 5, I don't actually remember how it
> > worked,
> > > but I have two pretty good guesses from what I remember of the overall
> > > architecture:
> > >
> > > 1. DynamoDB: This is the more likely mechanism. Each index shard has a
> > > unique ID which serves as a partition key in DynamoDB and there's a
> > > sequence number as a sort key. The writer side-car inserts a DynamoDB
> > > record with the next sequence number and the metadata URL. The searcher
> > > side-car periodically fetches 1 record with the partition key by
> > descending
> > > sequence number (i.e. get latest sequence entry for the partition key).
> > If
> > > the sequence number has increased, then call the searcher's
> > use-checkpoint
> > > API.
> > > 2. Kinesis: This feels like the less likely mechanism, but I guess it
> > could
> > > work. The writer side-car writes the metadata URL to a Kinesis stream.
> > Each
> > > searcher side-car reads from the Kinesis stream and passes the metadata
> > URL
> > > to the searcher. I'm pretty sure we didn't have one Kinesis stream per
> > > index shard, because managing (and paying for) that many Kinesis streams
> > > would be a pain. Even with a sharded Kinesis stream, you'd "leak" some
> > > checkpoints across index shards, leading to data that the searcher
> > > side-cars would throw away. Also, each Kinesis stream shard has a limited
> > > number of concurrent consumers, which would mean that the number of
> > search
> > > replicas would be limited. I'm *pretty sure* we used the DynamoDB
> > approach.
> > >
> > > Another Lucene-based search system that I worked on many years ago at
> > > Amazon had a much simpler architecture:
> > >
> > > 1. Writer periodically writes new segments to S3.
> > > 2. After writing the new segments, the writer writes a metadata object to
> > > S3 with a path like "/<writer guid>/<index id>/<shard id>/<sequence
> > > number>/metadata.json". Because the writer was guaranteed to be the
> > *only*
> > > thing writing with prefixes of <writer guid>, it could manage its own
> > dense
> > > sequence numbers.
> > > 3. A searcher is "sticky" to a writer, and periodically issues an S3
> > > GetObject for the next metadata object's full URL (i.e. the URL using the
> > > next dense sequence number). Until the next checkpoint is written, it
> > gets
> > > a 404 response.
> > > 4. Searcher fetches the files referenced by the metadata file.
> > >
> > > A nice thing about that approach was that it only depended on S3 and only
> > > used PutObject and GetObject APIs, which tend to be more consistent. The
> > > downside was that we needed a separate mechanism for writer discovery and
> > > failover, to let searchers know the correct writer prefix.
> > >
> > > Hope that helps! Let me know if you need any other suggestions.
> > >
> > > Thanks,
> > > Froh
> > >
> > > On Wed, Feb 26, 2025 at 3:31 PM Steven Schlansker <
> > > stevenschlans...@gmail.com> wrote:
> > >
> > > >
> > > >
> > > > > On Feb 26, 2025, at 2:53 PM, Marc Davenport <
> > madavenp...@cargurus.com
> > > .INVALID>
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > > Our current search solution is a pretty big monolith running on
> > pretty
> > > > > beefy EC2 instances.  Every node is responsible for indexing and
> > > serving
> > > > > queries.  We want to start decomposing our service and are starting
> > > with
> > > > > separating the indexing and query handling responsibilities.
> > > >
> > > > We run a probably comparatively small but otherwise similar
> > installation,
> > > > using
> > > > Google Kubernetes instances. We just use a persistent disk instead of
> > an
> > > > elastic store, but
> > > > also would consider using something like S3 in the future.
> > > >
> > > > > I'm in the research phases now trying to collect any prior art I can.
> > > The
> > > > > rough sketch is to implement the NRT two replication node classes on
> > > > their
> > > > > respective services and use S3 as a distribution point for the
> > segment
> > > > > files.  I'm still debating if there should be some direct knowledge
> > of
> > > > the
> > > > > replicas in the primary node.
> > > >
> > > > We tried to avoid the primary keeping any durable state about replicas,
> > > > as replicas tend to disappear for any or no reason in a cloud
> > > environment.
> > > > A specific example issue we ran into: we disable the
> > > > 'waitForAllRemotesToClose' step entirely
> > > > https://github.com/apache/lucene/pull/11822
> > > >
> > > > > Or if the primary node can just churn away
> > > > > creating base indexes and updates and publish to a queue when it
> > > > produces a
> > > > > new set of segments. Then the replicas are then free to pick up the
> > > > latest
> > > > > index as they spin up and subscribe to changes for it. It seems like
> > > > having
> > > > > the indexer being responsible for also communicating with the
> > replicas
> > > > > would be double duty for that system.  I'd love to hear other
> > > experiences
> > > > > if people can share them or point to writings about them they read
> > when
> > > > > designing their systems.
> > > >
> > > > In our case, the primary keeps the latest CopyState that any replica
> > > > should take in memory.
> > > > Replicas call a HTTP api in an infinite loop, passing in their current
> > > > version, and asking if any newer version is available.
> > > > If the replica is behind, the primary gives it the current CopyState
> > NRT
> > > > point immediately.
> > > > If the replica is caught up, we hold the request in a long-poll for
> > just
> > > > short of our http timeout, waiting for
> > > > a new version to become available, otherwise return "no update for now,
> > > > try again".
> > > >
> > > > Once the replica receives an updated CopyState, it feeds it into the
> > > > ReplicaNode with newNRTPoint which starts the file copy
> > > >
> > > > There's a bit of magic we devised internally around retaining the
> > > > CopyState reference count, as this is stateful and the HTTP replication
> > > is
> > > > mostly stateless.
> > > > We keep this simple by having only a single primary node at any given
> > > time.
> > > >
> > > > Your idea of using a queue instead is interesting but not something we
> > > > extensively looked at :)
> > > >
> > > > > I've looked at nrtsearch from yelp and they seem to let the primary
> > > node
> > > > > have direct knowledge of the replicas.  That makes sense since it is
> > > > based
> > > > > on McCandless's LuceneServer.
> > > > >
> > > > > I know that Amazon internally uses Lucene and has indexing separated
> > > from
> > > > > query nodes and that they re-index and publish completely new indexes
> > > > with
> > > > > every release to prod. I've been watching what I can of the great
> > > videos
> > > > of
> > > > > Sokolov, McCandless, & Froh etc. But they don't show much behind the
> > > > > curtain (understandably) about the details of keeping things in sync.
> > > >  If
> > > > > someone does know of a publicly available video or resource that
> > > > describes
> > > > > this, I'd love to see it.
> > > >
> > > > Unfortunately our journey with the LuceneServer was basically informed
> > by
> > > > the code and blog posts you've likely already seen -
> > > > and a bit of help from this mailing list - otherwise, there's not a ton
> > > of
> > > > information out there or evidence of widespread use.
> > > > That said, we've been very happy with our setup, despite needing to put
> > > in
> > > > a fair amount of elbow grease and low level details to get things
> > working
> > > > for us.
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: NRT segment replication in AWS

Reply via email to