Hello,
Our current search solution is a pretty big monolith running on pretty
beefy EC2 instances.  Every node is responsible for indexing and serving
queries.  We want to start decomposing our service and are starting with
separating the indexing and query handling responsibilities.

I'm in the research phases now trying to collect any prior art I can. The
rough sketch is to implement the NRT two replication node classes on their
respective services and use S3 as a distribution point for the segment
files.  I'm still debating if there should be some direct knowledge of the
replicas in the primary node. Or if the primary node can just churn away
creating base indexes and updates and publish to a queue when it produces a
new set of segments. Then the replicas are then free to pick up the latest
index as they spin up and subscribe to changes for it. It seems like having
the indexer being responsible for also communicating with the replicas
would be double duty for that system.  I'd love to hear other experiences
if people can share them or point to writings about them they read when
designing their systems.

I've looked at nrtsearch from yelp and they seem to let the primary node
have direct knowledge of the replicas.  That makes sense since it is based
on McCandless's LuceneServer.

I know that Amazon internally uses Lucene and has indexing separated from
query nodes and that they re-index and publish completely new indexes with
every release to prod. I've been watching what I can of the great videos of
Sokolov, McCandless, & Froh etc. But they don't show much behind the
curtain (understandably) about the details of keeping things in sync.   If
someone does know of a publicly available video or resource that describes
this, I'd love to see it.


Thank you and looking forward to whatever resources or thoughts you have,
Marc

Reply via email to