I think the other thing is that many devs like to understand what they are
doing and why at a high level rather than reach into the mud much and feel
around.

You will find a lot of devs that will spend a tremendous amount of time
working to solve problems with what they have learned twiddling gc
parameters.
You find far fewer that will do the unknown work necessary of looking into
GC problems due to the system and addressing terrible behavior or leaks -
unless some work incident presents some obvious whale. If it's bad enough,
a customer and employment will set up the mission and params - if it's bad
but not quite that bad, it can live almost forever. I pointed out a pretty
nasty long survivor to Noble, one of the survivors that also had teeth, so
it was easy to remember and relate. The nice thing about Noble, of course,
is no fear, and sometimes we miss the communication link, but if he catches
on to what I'm trying to say, he's like oh man, and he addresses it. I
checked on that one just today, he got it. I've found a variety of things I
had related to him where I don't know if they really registered or not, for
most others they don't, and he went and got them. Nice but rare.

So all that Cloud low level stuff, data loss, naive startup, unnecessary
waits, multiple waits, unnecessary syncs, broken, sily return syncs,
unreliable leadership, unreliable zk ...

It's not something you sit there and reason all out at a high level. Maybe
a work incident occasionally points you to a whale but even that is a high
bar to doing more than very targeted soldering - maybe you do more harm
than good in the small window of your strike mission - so conservative is
your friend.

So the system can be fairly silly at the core and as long as it's not just
flatly dead in your face, much like most any given test, there is gc that
could be tuned. Review nits that could be found. High level feature
improvements that could be made. Which is why 'you' even running through
and taking notes and spotting issues at a more fundamental level, while it
sounds silly to say, is pretty unusual and a bit of fresh air and that's
why I often point it out. There is plenty you can spot. There is plenty
that is beyond complexity's mercy and you have to pull them out. I found a
great hammer for that, but it requires something. In any case, if you look,
there are enough items, enough complexity, enough unknown requirements to
geting things improved, that most don't even want to risk diving in for an
item like improving how we timeout on startup and ditch data. It went in on
day one. I'm sure it's goten it's dusting once since or something, new
timeout param name, new default timeout, something, but touching too hard
is not safe. It's not warm. Like gc tuning and high level api refining.
There is major fear, because in isolated endeavors there is little time for
understanding and exploration. Sometimes there are more fearless moments,
that is what it took for LIR. But that brings no guarantees either. The
first LIR caused some of the most in the users face problems, competing for
the problems it addressed but would be rarely noticed, for some time due to
the time it took for upgrades beyond that and the time it took to
repeatedly attempt to mitigate it and release those mitigations. The second
LIR  was a fantastic improvement, but many years on, sits well below it's
even basic finished state, promise and potential. Isolated mission.
Conservative. Forgotten. You don't set up a new strike team to go after the
terrorist you mostly got I guess. Those remaining don't have the same
potential, there is comfortable, understandable, customer driven stuff to
do that is much more amenable. Anyway, it's a culture thing, a more common
thing, an employer thing, the rare outliers are just that and their value
tends to keep them from setting up camp to dig on a system that is hard to
pull satisfaction from.

So anyway, another thing you could look at is ConnectionManager.java.

File:
/mnt/s1/solr3/solr/solrj/src/java/org/apache/solr/common/cloud/ConnectionManager.java
52:   // Track the likely expired state
53:   private static class LikelyExpiredState {
54:     private static LikelyExpiredState NOT_EXPIRED = new
LikelyExpiredState(StateType.NOT_EXPIRED, 0);
55:     private static LikelyExpiredState EXPIRED = new
LikelyExpiredState(StateType.EXPIRED, 0);
56:
57:     public enum StateType {
58:       NOT_EXPIRED,    // definitely not expired
59:       EXPIRED,        // definitely expired
60:       TRACKING_TIME   // not sure, tracking time of last disconnect
61:     }
62:
63:     private StateType stateType;
64:     private long lastDisconnectTime;
65:     public LikelyExpiredState(StateType stateType, long
lastDisconnectTime) {
66:       this.stateType = stateType;
67:       this.lastDisconnectTime = lastDisconnectTime;
68:     }
69:
70:     public boolean isLikelyExpired(long timeToExpire) {
71:       return stateType == StateType.EXPIRED
72:         || ( stateType == StateType.TRACKING_TIME && (System.nanoTime()
- lastDisconnectTime >  TimeUnit.NANOSECONDS.convert(timeToExpire,
TimeUnit.MILLISECONDS)));
73:     }
74:   }

This way we track 'likelyExpired'. Usually the issue faced is that the
machine is a bit overloaded. Dealing with gc pauses that are too long. Too
many threads and updates. It's not that a meteor hit Zk server 1 and so 2
is taking over. That is pretty rare in comparison. But even that probably
does not favor this behavior. The system is having Zk connection problems
and our strategy is to basically say, how long do you think we can ignore
it? And the thing is, ignoring it is not going to even often end up so
great in the best of cases - we need to call out to zk in some surprising
places. There is even a spot that the update chain with a great comment of
shame somewhere calls to ZK directly. But it's also the opposite of what ZK
tells you is the right idea, and they are correct. Back off on connection
issues - chill out - let it come back - then continue. That is, among other
reasons, why retrying the way we do with ZkCmdExecutor is also not a good
idea. If you do this, intermittent problems tends to resolve much faster -
not spiral down - and so you just wait up a bit, rather than kicking back
exceptions  and fails to the user right away either, the system is much
much more stable and reliable.

Next, if you dig through the process method of the watcher:

File:
/mnt/s1/solr3/solr/solrj/src/java/org/apache/solr/common/cloud/ConnectionManager.java
109:  @Override
110:   public void process(WatchedEvent event) {
111:     if (event.getState() == AuthFailed || event.getState() ==
Disconnected || event.getState() == Expired) {
112:       log.warn("Watcher {} name: {} got event {} path: {} type: {}",
this, name, event, event.getPath(), event.getType());
113:     } else {
114:       if (log.isDebugEnabled()) {
115:         log.debug("Watcher {} name: {} got event {} path: {} type:
{}", this, name, event, event.getPath(), event.getType());
116:       }
117:     }
118:
119:     if (isClosed()) {
120:       log.debug("Client->ZooKeeper status change trigger but we are
already closed");
121:       return;
122:     }
123:
124:     KeeperState state = event.getState();
125:
126:     if (state == KeeperState.SyncConnected) {
127:       log.info("zkClient has connected");
128:       connected();
129:       connectionStrategy.connected();
130:     } else if (state == Expired) {
131:       if (isClosed()) {
132:         return;
133:       }
134:       // we don't call disconnected here, because we know we are
expired
135:       connected = false;
136:       likelyExpiredState = LikelyExpiredState.EXPIRED;
137:
138:       log.warn("Our previous ZooKeeper session was expired. Attempting
to reconnect to recover relationship with ZooKeeper...");
139:
140:       if (beforeReconnect != null) {
141:         try {
142:           beforeReconnect.command();
143:         } catch (Exception e) {
144:           log.warn("Exception running beforeReconnect command", e);
145:         }
146:       }
147:

Look at everything we do in line in that process method. Here and there we
have some very small window synchronization or whatever.

Now, you normally don't have to worry about what you do in a watcher
process method. We can say that, because every watcher has that
notification fired on a thread from a big fat executor, not the zk event
thread. This is not really typical ZK, but it kind of let's you mitigate
and not have to future worry about what you do in that process loop. It's
also got plenty of downsides in terms of resource management.  The result
is a bit tough to manage chaos. Those watcher events now can come out of
order. Or you limit the executor to one thread and they come in order, but
serially.

Anyway, the ConnectionManager, this class, uses a separate executor with 1
thread. So basically, through everything we do in that process method, we
have locked off ZK connection even notifications. All I can say is that may
not be the best situation leading to the ideal behaviors.

This stuff is tricky in the best of cases - on very old code stood up on
very old ideas of what was reasonable, tricky would be an enjoyment.
Especially since as much complexity and poor behavior as you can find in
each of these classes and distinct functions and implementations - they all
tie together in a complexity multiplication party. Which is why I try to
balance that I know it can be addressed and I also know in many cases that
is likely poor information if taken the wrong way. Curator ;)

You will also see that waitForConnectedMethod I mentioned - I can't
remember if it wants improvement or is fine close to as is, but if you
look, too random places already use it this way. One is in ZkShardTerms:

File:
/mnt/s1/solr3/solr/core/src/java/org/apache/solr/cloud/ZkShardTerms.java
369:   private void retryRegisterWatcher() {
370:     while (!isClosed.get()) {
371:       try {
372:         registerWatcher();
373:         return;
374:       } catch (KeeperException.SessionExpiredException |
KeeperException.AuthFailedException e) {
375:         isClosed.set(true);
376:         log.error("Failed watching shard term for collection: {} due
to unrecoverable exception", collection, e);
377:         return;
378:       } catch (KeeperException e) {
379:         log.warn("Failed watching shard term for collection: {},
retrying!", collection, e);
380:         try {
381:
zkClient.getConnectionManager().waitForConnected(zkClient.getZkClientTimeout());
382:         } catch (TimeoutException te) {
383:           if (Thread.interrupted()) {
384:             throw new
SolrException(SolrException.ErrorCode.SERVER_ERROR, "Error watching shard
term for collection: " + collection, te);
385:           }
386:         }
387:       }
388:     }
389:   }

That Dat is sharp as hell, he was on the right path of things left and
right with surely not enough deference or time to devote to it. I tried to
get him back in the game, but I'd already given up my clout and he was
location tricky.

But that is the move.

This ConnectionManager, the Leader Election, the Leader Sync, the
SolrZkClient, the Overseers distaste and disregard for the rest of the
system. These are the core, long lived, fundamental issues. You can find a
lot of issues in a code base of this size and complexity that has battled
off so many for so long, but a lot of that is much more livable. The fact
that the heart of the system is these heavily flawed and neglected pieces
is where the real meat is. And there is more than I can just pull off the
top of my head taking a quick gander. I didn't find great items by just
starting at code for 30 seconds and using some pre gained knowledge to
turbocharge a quick glance analysis. I spent a tremendous amount of time
with a system that exposed issues to me, issues I'd never have reasoned out
or guessed or turbo analyzed.

And I won't pretend that I have the best answers for your specific needs
and situation and desires and coworkers and community future. When I say
these things can work and to the degree they can work, and when I say
Curator can work and or something else I tried on another go can work,
that's meant to be additional information, helpful information, my
experience that I bring back for ill or good. In many cases, I personally
would not travel many of the roads I've said can be made to work.
Replacement, design around, simplify, scale down, I'd look at the whole
toolset depending on all the constraints. To know that many of these things
can largely work as is, or with some specified different component or
direction is from my end simply more data points in the cap. One of the
often and bat to the head takeaways I repeatedly got when working through
making something work was how it was a terrible trappy situation to begin
with even in the best case. The first time I really started getting to the
bottom of stuff, I stopped thinking so hard about what could work and how
well and I started really focusing on what are the problems and why and
what to do to to combat those issues in a group of disparate developers of
different levels and code familiarity - much more so than what you needed
to do to combat the system.

But the group of developers around the surface and edges of the cloud core
were even more indignified and outraged with that angle than the 'it's
pretty damn poor and can be pretty damn good' angle. Those closer to the
core were easy, but already few and already filtering off and out or locked
up in various ways.

But that is of course the same calculus today as then. If you can develop
and plan defensively, with all of the current mishaps and silliness that
goes on, and looking through it seeds a bunch of better ideas, you can set
up a lot better than simply fixing and improving the clearly poor bits into
faster, prettier bits. Much more impactful than the quality of them
currently is the story of them and how a new story might end up with
different results.

Which is to say, break things :) Change things. Do things differently. If I
tell you this design can fly and be solid, it's because that is what I have
to tell you. Flying is like one quadrant of 4, flying solid a bit more.
Just parts of the puzzle, the ones I settled into and could enjoy after a
certain point that the other pieces became too unenjoyable.

Reply via email to