Hello, Our current search solution is a pretty big monolith running on pretty beefy EC2 instances. Every node is responsible for indexing and serving queries. We want to start decomposing our service and are starting with separating the indexing and query handling responsibilities.
I'm in the research phases now trying to collect any prior art I can. The rough sketch is to implement the NRT two replication node classes on their respective services and use S3 as a distribution point for the segment files. I'm still debating if there should be some direct knowledge of the replicas in the primary node. Or if the primary node can just churn away creating base indexes and updates and publish to a queue when it produces a new set of segments. Then the replicas are then free to pick up the latest index as they spin up and subscribe to changes for it. It seems like having the indexer being responsible for also communicating with the replicas would be double duty for that system. I'd love to hear other experiences if people can share them or point to writings about them they read when designing their systems. I've looked at nrtsearch from yelp and they seem to let the primary node have direct knowledge of the replicas. That makes sense since it is based on McCandless's LuceneServer. I know that Amazon internally uses Lucene and has indexing separated from query nodes and that they re-index and publish completely new indexes with every release to prod. I've been watching what I can of the great videos of Sokolov, McCandless, & Froh etc. But they don't show much behind the curtain (understandably) about the details of keeping things in sync. If someone does know of a publicly available video or resource that describes this, I'd love to see it. Thank you and looking forward to whatever resources or thoughts you have, Marc