Things are also counterintuitive. The more you fix and the faster things work the more things fail. It’s like rings of hell.
Mark On Sat, Nov 2, 2019 at 10:29 PM Mark Miller <markrmil...@gmail.com> wrote: > And it didnt get any easier. What I did about it is kill myself multiple > times over 2 years for weeks on end of torturing my wife. And I found a > million problems, a million bugs, a million terrible inefficiencies. And I > fixed and lost countless of them friggen twice. And didnt lose tons of the > work as well. And so it's not easy to get out of this. Its not easy at all. > And i havent even done the hard part yet. > > - Mark > > On Sat, Nov 2, 2019 at 10:24 PM Mark Miller <markrmil...@gmail.com> wrote: > >> I mean the reality is - why do we not have just a single watcher per node >> pulling in state. We are we not tracking and minimizing state transfers and >> changes? Why are we not measuring the time it takes to round trip a >> state.json and adjusting? Looking at load to adjust overseerish duties and >> leader election? A million other smart things? >> >> Because it's too hard. It's too hard and we all gave up long ago on >> figuring out what to do about it. Because we are programming in assembly in >> an abyss when we should be doing java in the clouds. >> >> Everyone knows the SolrCloud DNA one way or another.We all somehow made >> our peace with it or not. >> >> It's easy when you dont go deep. Hell thats easy to forget even if you do. >> >> But I'm looping on it now, have to eject. >> >> - Mark >> >> On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <markrmil...@gmail.com> >> wrote: >> >>> Not much. Something you can understand. How about tests < 10 seconds >>> fail or not. Good logging and as a backup good debug logging. Docs on how >>> things are designed to work? Tracking of all important operations and how >>> long they take with tight cutoffs? Proper response to interruption 100% of >>> the time? The idea of a cluster start and stop? Of a cluster install to ZK >>> initially. Drop all legacyCloud support, stateformat=1 support, maybe a few >>> other things. >>> >>> I've got some stuff, I'm gonna pull out as fast as I sensibly can given >>> many setbacks and too little sleep for a long time. >>> >>> I'm not here to do all the of the lift for everyone, but unless I get >>> sick in the next week or two or my 10 backup methods and git pushes and >>> backup branches fail or I just burn the hell out, I have a solid refuge >>> that we can knock out and then build on with confidence. >>> >>> - Mark >>> >>> On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dragonsi...@gmail.com> wrote: >>> >>>> Very much agreed. I've been trying to figure out for a long time what >>>> is the point in having a replica DOWN state that has to be toggled (DOWN >>>> and then UP!) every time a node restarts. Considering that we could just >>>> combine ACTIVE and `live_nodes` to understand whether a replica is >>>> available. It's not even foolproof since kill -9 on a solr node won't mark >>>> all the replicas DOWN-- that doesn't happen until the node comes back up >>>> (perversely). >>>> >>>> What would it take to get to a state where restarting a node would >>>> require a minimal amount of ZK work in most cases? >>>> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmil...@gmail.com> >>>> wrote: >>>> >>>>> Give me a short bit to follow up and I will lay out my case and >>>>> proposal. >>>>> >>>>> Everyone is then free to decide that we need to do something drastic >>>>> or that I'm wrong and we should just continue down the same road. If >>>>> that's >>>>> the case, a lot of your work will get a lot easier and less impeded by me >>>>> and we will still all be happier. Win win. >>>>> >>>>> If we can just not make drastic changes for a just a brief week or so >>>>> window, I'll say what I have to say, you guys can judge and do whatever >>>>> you'd please. >>>>> >>>>> - mark >>>>> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmil...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey All Solr Dev's, >>>>>> >>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled, >>>>>> the Overseer, is mix and mess of proper exception handling and super slow >>>>>> startup and shutdown, adding new things all the time with no concern for >>>>>> performance or proper ordering (which is harder to tell than you think). >>>>>> >>>>>> Our class dependency graph doesn't even work - we just force it. Sort >>>>>> of. If the whole system doesn't block and choke it's way to a start slow >>>>>> enough, lots of things fail. >>>>>> >>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of >>>>>> time, what you want eventually come back without causing too much damage. >>>>>> >>>>>> There are so many things are are off or just plain wrong and the list >>>>>> is growing and growing. No one is following this or if you are, please >>>>>> back >>>>>> me up. This thing will collapse under it's own wait. >>>>>> >>>>>> So if you want to add yet another state format cluster state or some >>>>>> other optimization on this junk heap, you can expect me to push back. >>>>>> >>>>>> We should all be embarrassed by the state of things. >>>>>> >>>>>> I've got some ideas for addressing them that I'll share soon, but >>>>>> god, don't keep optimizing a turd in non backcompat Overseer loving ways. >>>>>> That Overseer is an atrocity. >>>>>> >>>>>> -- >>>>>> - Mark >>>>>> >>>>>> http://about.me/markrmiller >>>>>> >>>>> >>>>> >>>>> -- >>>>> - Mark >>>>> >>>>> http://about.me/markrmiller >>>>> >>>> >>> >>> -- >>> - Mark >>> >>> http://about.me/markrmiller >>> >> >> >> -- >> - Mark >> >> http://about.me/markrmiller >> > > > -- > - Mark > > http://about.me/markrmiller > -- - Mark http://about.me/markrmiller