Things are also counterintuitive. The more you fix and the faster things
work the more things fail. It’s like rings of hell.

Mark

On Sat, Nov 2, 2019 at 10:29 PM Mark Miller <markrmil...@gmail.com> wrote:

> And it didnt get any easier. What I did about it is kill myself multiple
> times over 2 years for weeks on end of torturing my wife. And I found a
> million problems, a million bugs, a million terrible inefficiencies. And I
> fixed and lost countless of them friggen twice. And didnt lose tons of the
> work as well. And so it's not easy to get out of this. Its not easy at all.
> And i havent even done the hard part yet.
>
> - Mark
>
> On Sat, Nov 2, 2019 at 10:24 PM Mark Miller <markrmil...@gmail.com> wrote:
>
>> I mean the reality is - why do we not have just a single watcher per node
>> pulling in state. We are we not tracking and minimizing state transfers and
>> changes? Why are we not measuring the time it takes to round trip a
>> state.json and adjusting? Looking at load to adjust overseerish duties and
>> leader election? A million other smart things?
>>
>> Because it's too hard. It's too hard and we all gave up long ago on
>> figuring out what to do about it. Because we are programming in assembly in
>> an abyss when we should be doing java in the clouds.
>>
>> Everyone knows the SolrCloud DNA one way or another.We all somehow made
>> our peace with it or not.
>>
>> It's easy when you dont go deep. Hell thats easy to forget even if you do.
>>
>> But I'm looping on it now, have to eject.
>>
>> - Mark
>>
>> On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <markrmil...@gmail.com>
>> wrote:
>>
>>> Not much. Something you can understand. How about tests < 10 seconds
>>> fail or not. Good logging and as a backup good debug logging. Docs on how
>>> things are designed to work? Tracking of all important operations and how
>>> long they take with tight cutoffs? Proper response to interruption 100% of
>>> the time? The idea of a cluster start and stop? Of a cluster install to ZK
>>> initially. Drop all legacyCloud support, stateformat=1 support, maybe a few
>>> other things.
>>>
>>> I've got some stuff, I'm gonna pull out as fast as I sensibly can given
>>> many setbacks and too little sleep for a long time.
>>>
>>> I'm not here to do all the of the lift for everyone, but unless I get
>>> sick in the next week or two or my 10 backup methods and git pushes and
>>> backup branches fail or I just burn the hell out, I have a solid refuge
>>> that we can knock out and then build on with confidence.
>>>
>>> - Mark
>>>
>>> On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dragonsi...@gmail.com> wrote:
>>>
>>>> Very much agreed.  I've been trying to figure out for a long time what
>>>> is the point in having a replica DOWN state that has to be toggled (DOWN
>>>> and then UP!) every time a node restarts.  Considering that we could just
>>>> combine ACTIVE and `live_nodes` to understand whether a replica is
>>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>>> (perversely).
>>>>
>>>> What would it take to get to a state where restarting a node would
>>>> require a minimal amount of ZK work in most cases?
>>>>
>>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>
>>>>> Give me a short bit to follow up and I will lay out my case and
>>>>> proposal.
>>>>>
>>>>> Everyone is then free to decide that we need to do something drastic
>>>>> or that I'm wrong and we should just continue down the same road. If 
>>>>> that's
>>>>> the case, a lot of your work will get a lot easier and less impeded by me
>>>>> and we will still all be happier. Win win.
>>>>>
>>>>> If we can just not make drastic changes for a just a brief week or so
>>>>> window, I'll say what I have to say, you guys can judge and do whatever
>>>>> you'd please.
>>>>>
>>>>> - mark
>>>>>
>>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmil...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey All Solr Dev's,
>>>>>>
>>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>>>> startup and shutdown, adding new things all the time with no concern for
>>>>>> performance or proper ordering (which is harder to tell than you think).
>>>>>>
>>>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>>>> enough, lots of things fail.
>>>>>>
>>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>>>> time, what you want eventually come back without causing too much damage.
>>>>>>
>>>>>> There are so many things are are off or just plain wrong and the list
>>>>>> is growing and growing. No one is following this or if you are, please 
>>>>>> back
>>>>>> me up. This thing will collapse under it's own wait.
>>>>>>
>>>>>> So if you want to add yet another state format cluster state or some
>>>>>> other optimization on this junk heap, you can expect me to push back.
>>>>>>
>>>>>> We should all be embarrassed by the state of things.
>>>>>>
>>>>>> I've got some ideas for addressing them that I'll share soon, but
>>>>>> god, don't keep optimizing a turd in non backcompat Overseer loving ways.
>>>>>> That Overseer is an atrocity.
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Reply via email to