[Evergreen-dev] Re: When to go to 4.0

Mike Rylander via Evergreen-dev Wed, 09 Jul 2025 09:45:23 -0700

FWIW, I'm -1 on calling the next release 4.0 as of today, because the
biggest planned change is probably the breaking-est -- the merge of
OpenSRF and the xmpp-to-redis change -- and it's just not ready yet.


I'll say up front that if we /don't/ merge OpenSRF into EG before the
next release (and IMO we should not, based on the state of things
today), and therefore force Redis, but we still want to call it 4.0
for other big reasons, I would definitely soften my -1 to -0.5 or
less.

If you don't care much about the details of the Redis stuff, that -^
is my top line thought on the  "should we call it 4.0" question, and
you can ignore the rest of my rant! ;)

-------

I've been working on the opensrf-on-redis infrastructure for the last
month or so with the goal of bringing back the HA and LB functionality
that we got for free with XMPP.

TL;DR: I'm close, but because of inherent foundational differences in
the design and purpose of XMPP vs Redis, our code will simply have to
be more complicated going forward.

IMO, the major issues in (and the state of my changes compared to)
origin/main of the opensrf repo, re redis are:

* It's extremely complicated and labor intensive (and maybe
impossible, but I only tried to make it work for a couple days) to
configure multiple, separate but interacting OpenSRF domains across
different Redis servers.  At the other end of the spectrum, it's also
impossible to configure multi-tenant redis servers.
    -- This is mainly a /configuration capabilities/ issue, not
primarily a code issue, because Bill did add OpenSRF usernames and
domains (xmpp domains, before; hosts that run redis, now) to the redis
keys used by EG.  The structure of the keys is not future-proof and
doesn't follow redis key space pattern recommendations (at least WRT
planning for Redis-level clustering, HA, and LB), but since it exists
today we should be able to change the key structure later at a
breaking upgrade event (or, whenever we want, if OpenSRF is merged
into EG).  However, having the "bus" account configuration duplicated
externally, and configured using a single static file, is not tenable.
    ++ I've addressed this by adjusting the redis config requirements
a little, and providing three new configuration modes, targeting use
cases of different complexity/need:
      1) Instead of leaving the redis server open and unprotected by
default and trying to find the password in the "bus accounts" file,
the Redis "requirepass" setting is used to supply the password for the
"default" (admin/root/whatever) user.
      2) osrf_control can receive that password from
        a) the REDISCLI_AUTH env variable -- generally securable from outside.
        b) a dedicated file's content -- at least the file can be
locked down to a specific unix user.
        c) a command line option -- meh, handy for manual use, but
shows up in `ps`.
        d) extracted from the "bus accounts file" from before, for back-compat.
      3) Made configuring Redis users/ACLs more flexible:
        a) the existing "bus accounts file" mechanism continues to
exist, but because the same file is applied to each domain it's not
safe for an HA/LB env because it it's not domain- or user-aware.
        b) a TT2 template can be supplied; it is processed for each
domain separately, so complicated setups can be encoded in the
template -- this is intended to provide an HA/LB-safe version of (a).
        c) osrf_control can dynamically create the necessary ACLs for
the router, service, client, and gateway users and keys specific to
each domain -- this is the mechanism that has the broadest set of use
cases, I think.
        d) OpenSRF can be told that Redis' built in ACL infrastructure
(the "aclfile" Redis config file setting, and friends) will just
handle it, and a bus reset request just issues an "ACL LOAD" command
to tell redis to refresh ACLs in its native way -- this mechanism
provides the most logical separation, and I think will be useful in
highly controlled/automated environments that want to make use of the
Redis-developer-intended tools for ACL config.

 * LB (cross-registration of OpenSRF domains) does not work
    -- The register and unregister commands add additional instances
to an internal list of endpoints for each service, but the router
always uses the first entry in the list.  The effect is that all
traffic gets shoveled to the first-registered instance (not
necessarily the local one, mind) until that instance actively
deregisters, then it moves to the next one that registered.
    ++ I've added list rotation. That works and is an obvious fix, of
course, but it points out that the code is definitely not fully baked
or feature-tested, and it's lacking existing fault tolerance at an
infrastructure level.

 * HA does not work, and LB (when fixed as above) is not safe
    -- Even after addressing the LB part of the cross-registration
functionality, there is no way to detect that a service instance
previously registered is no longer available and should be removed
from the delivery list.  Because we're using redis LISTs to stand in
for (effectively) stateful TCP sockets and receive buffers, we end up
just tossing requests into the void and hoping that someone comes
along to service them.  Put another way, if a listener dies, we have
no way of detecting that at the OpenSRF level and accounting for the
failure.  This makes LB /more/ dangerous: think something akin to
split-brain DNS problems.  Because we can't trust either our internal
state or the message delivery information from redis.  This is also
something that we got 100% for free in XMPP, because message delivery
to an actual endpoint was verified and we got an error when that
failed, so we could resend to another service instance.  Now the
message just falls into the void on a LIST key that nobody is looking
at.
    ++ I'm working on moving from LISTs to STREAMs for router and
service keys. Other than the slight difference in surface-level
commands, it's no harder to use streams than lists.  What this will
allow us to do is recheck the state of previously sent messages, and
if 1) they're "stale" and 2) no service instance has claimed them for
processing, we can retract the message from the stream, deregister the
service instance behind the redis key on which the message went stale,
and send it to another service instance.  I have the baseline change
from LISTs to STREAMs working now, modulo some debug-logging cleanup
and chasing down a couple possible leaks and corner cases, but the
redis docs are fighting me at every step. (Just ask separately if you
want to hear more about that.)  I also have a proof of concept version
of the message retraction and resend code, but I really want to
rewrite that using what I've learned (*sad face*) in the last few
weeks about redis.

 * Infrastructure-level clustering isn't possible
  -- Whether ejabberd or Redis, infrastructure clustering (transparent
HA at the infrastructure level) isn't "easy", and the hard parts have
to live somewhere... In the XMPP world, that was mostly ejabberd's
problem and it handled it well.  Redis has the concept of clustering,
but (so far) we've chosen to not only ignore that, but to construct
things in such a way that the redis cluster stuff /cannot be used
effectively/.  I have no proof-of-concept code to address this, yet.
We may never have the option to configure things to be as
transparently robust in the redis world as we do today with ejabberd.
That may not matter to most people most of the time, but it's a point
I feel compelled to raise because it's definitely a loss to admins of
large, complex, heavily automated installations (even if they're not
aware of that loss).

I'll be pushing up a branch covering the first two points this week or
next, and hopefully be able to follow up with the HA fixes ASAP.

Thanks for following my rant this far... :)

--
Mike Rylander
Research and Development Manager
Equinox Open Library Initiative
1-877-OPEN-ILS (673-6457)
work: mi...@equinoxoli.org
personal: mrylan...@gmail.com
https://equinoxOLI.org

On Tue, Jul 8, 2025 at 7:22 PM Jeff Davis via Evergreen-dev
<evergreen-dev@list.evergreen-ils.org> wrote:
>
> We've been talking about calling our next major release Evergreen 4.0, rather 
> than 3.16.
>
> Is there a list of features that we want to include in a 4.0 release? Should 
> we hold off on bumping the version number to 4.0 until those features are 
> ready?
>
> Some candidates for "features that warrant going to 4.0":
> - Making Angular circ the standard circ UI, rather than experimental. My 
> understanding is that we don't expect that to happen in the next release.
> - Merging OpenSRF into Evergreen (LP#2032835). We were waiting to replace 
> ejabberd with Redis before doing that; Redis is now supported in Evergreen, 
> but I don't know if anyone has revisited merging OpenSRF into EG since then.
> - There are a number of bugs targeted to "4.0-beta" in Launchpad, but AFAIK 
> they are just targeting the next major release, whether it's called 4.0 or 
> not.
>
> Any opinions? I would prefer to reserve "4.0" for a release that is somehow 
> "more" than just the next major release, but I recognize that version 
> numbering is basically arbitrary.
> --
> Jeff Davis
> BC Libraries Cooperative
> _______________________________________________
> Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
> To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org
_______________________________________________
Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org
To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org

[Evergreen-dev] Re: When to go to 4.0

Reply via email to