Hey all, I'm a long time Solr user/developer, but only recently joined the dev mailing list for Solr so it is a pleasure to interact with you all.
We, at FullStory, working with Ishan and Noble, sponsored the Per Replica State implementation and are using it currently. We are running large clusters in production with a high number of collections/cores and have historically faced challenges with handling updates to state.json with events like node restarts at that scale. The size of the state.json file and coordinating all operations through overseer was not working well with many collections across many nodes, leading to developing the Per Replica State model. There are definite improvements that can be made in the code for PRS, we've actually made quite a few improvements on our Solr fork this year that we would still need to upstream. We've been loosely following the Distributed State Update concept, but haven't spent much time to understand pros/cons of that vs PRS. We'd definitely be interested in working with the community to share more about PRS and understand other efforts with the goal of pushing towards a more streamlined implementation. I'm not sure how the community has handled this in the past, if there is a small group we wanted to put together for some synchronous discussions, we could present on PRS and have a representative share about the Distributed State Update concept. If we want something more async, I can work with the team at FullStory to write up more detail on PRS to share out with the community and start to build some buy-in. I'm in agreement that distributed state is complicated as is, so working towards cleaner code here is important, so I'm interested to hear how we can help move forward. Justin On Wed, Sep 21, 2022 at 11:44 AM Houston Putman <hous...@apache.org> wrote: > Hey everyone, > > We've seen some interesting developments over the last 2 years in the way > that Solr state and distributed logic is handled. Notably we've seen the > introduction of PerReplicaStates (PRS) and the Distributed State Updates > (no overseer). > > I think for the health of our code and future maintainability, we should > really look to decide on what implementations we want to use for State > management and Distributed operations. Basically do we want to adopt or > abandon PRS/Distributed State Updates. Note that these are separate > concepts, so the decision on each will be separate. > > I bring this up because I see PRS a lot through the code and it feels like > the code is too separate from the original way of managing state. There is > a lot of "if (prsEnabled)" logic throughout the core, and its very hard to > understand how PRS actually works with this logic spread all over the > place. If we want to move forward with PRS, then we hopefully would be able > to consolidate the logic. > > I don't see the Distributed State Update logic nearly as much, but I > imagine our code can only get cleaner with one implementation versus two. > > This is just my opinion, let me know what y'all think about making > decisions or going forward with the status quo. > > - Houston >