Re: SolrCloud is sick.

Mark Miller Sat, 02 Nov 2019 21:01:15 -0700

Things are also counterintuitive. The more you fix and the faster things
work the more things fail. It’s like rings of hell.


Mark

On Sat, Nov 2, 2019 at 10:29 PM Mark Miller <markrmil...@gmail.com> wrote:

> And it didnt get any easier. What I did about it is kill myself multiple
> times over 2 years for weeks on end of torturing my wife. And I found a
> million problems, a million bugs, a million terrible inefficiencies. And I
> fixed and lost countless of them friggen twice. And didnt lose tons of the
> work as well. And so it's not easy to get out of this. Its not easy at all.
> And i havent even done the hard part yet.
>
> - Mark
>
> On Sat, Nov 2, 2019 at 10:24 PM Mark Miller <markrmil...@gmail.com> wrote:
>
>> I mean the reality is - why do we not have just a single watcher per node
>> pulling in state. We are we not tracking and minimizing state transfers and
>> changes? Why are we not measuring the time it takes to round trip a
>> state.json and adjusting? Looking at load to adjust overseerish duties and
>> leader election? A million other smart things?
>>
>> Because it's too hard. It's too hard and we all gave up long ago on
>> figuring out what to do about it. Because we are programming in assembly in
>> an abyss when we should be doing java in the clouds.
>>
>> Everyone knows the SolrCloud DNA one way or another.We all somehow made
>> our peace with it or not.
>>
>> It's easy when you dont go deep. Hell thats easy to forget even if you do.
>>
>> But I'm looping on it now, have to eject.
>>
>> - Mark
>>
>> On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <markrmil...@gmail.com>
>> wrote:
>>
>>> Not much. Something you can understand. How about tests < 10 seconds
>>> fail or not. Good logging and as a backup good debug logging. Docs on how
>>> things are designed to work? Tracking of all important operations and how
>>> long they take with tight cutoffs? Proper response to interruption 100% of
>>> the time? The idea of a cluster start and stop? Of a cluster install to ZK
>>> initially. Drop all legacyCloud support, stateformat=1 support, maybe a few
>>> other things.
>>>
>>> I've got some stuff, I'm gonna pull out as fast as I sensibly can given
>>> many setbacks and too little sleep for a long time.
>>>
>>> I'm not here to do all the of the lift for everyone, but unless I get
>>> sick in the next week or two or my 10 backup methods and git pushes and
>>> backup branches fail or I just burn the hell out, I have a solid refuge
>>> that we can knock out and then build on with confidence.
>>>
>>> - Mark
>>>
>>> On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dragonsi...@gmail.com> wrote:
>>>
>>>> Very much agreed.  I've been trying to figure out for a long time what
>>>> is the point in having a replica DOWN state that has to be toggled (DOWN
>>>> and then UP!) every time a node restarts.  Considering that we could just
>>>> combine ACTIVE and `live_nodes` to understand whether a replica is
>>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>>> (perversely).
>>>>
>>>> What would it take to get to a state where restarting a node would
>>>> require a minimal amount of ZK work in most cases?
>>>>
>>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>
>>>>> Give me a short bit to follow up and I will lay out my case and
>>>>> proposal.
>>>>>
>>>>> Everyone is then free to decide that we need to do something drastic
>>>>> or that I'm wrong and we should just continue down the same road. If 
>>>>> that's
>>>>> the case, a lot of your work will get a lot easier and less impeded by me
>>>>> and we will still all be happier. Win win.
>>>>>
>>>>> If we can just not make drastic changes for a just a brief week or so
>>>>> window, I'll say what I have to say, you guys can judge and do whatever
>>>>> you'd please.
>>>>>
>>>>> - mark
>>>>>
>>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <markrmil...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey All Solr Dev's,
>>>>>>
>>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>>>> startup and shutdown, adding new things all the time with no concern for
>>>>>> performance or proper ordering (which is harder to tell than you think).
>>>>>>
>>>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>>>> enough, lots of things fail.
>>>>>>
>>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>>>> time, what you want eventually come back without causing too much damage.
>>>>>>
>>>>>> There are so many things are are off or just plain wrong and the list
>>>>>> is growing and growing. No one is following this or if you are, please 
>>>>>> back
>>>>>> me up. This thing will collapse under it's own wait.
>>>>>>
>>>>>> So if you want to add yet another state format cluster state or some
>>>>>> other optimization on this junk heap, you can expect me to push back.
>>>>>>
>>>>>> We should all be embarrassed by the state of things.
>>>>>>
>>>>>> I've got some ideas for addressing them that I'll share soon, but
>>>>>> god, don't keep optimizing a turd in non backcompat Overseer loving ways.
>>>>>> That Overseer is an atrocity.
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Reply via email to