FWIW, I'm -1 on calling the next release 4.0 as of today, because the biggest planned change is probably the breaking-est -- the merge of OpenSRF and the xmpp-to-redis change -- and it's just not ready yet.
I'll say up front that if we /don't/ merge OpenSRF into EG before the next release (and IMO we should not, based on the state of things today), and therefore force Redis, but we still want to call it 4.0 for other big reasons, I would definitely soften my -1 to -0.5 or less. If you don't care much about the details of the Redis stuff, that -^ is my top line thought on the "should we call it 4.0" question, and you can ignore the rest of my rant! ;) ------- I've been working on the opensrf-on-redis infrastructure for the last month or so with the goal of bringing back the HA and LB functionality that we got for free with XMPP. TL;DR: I'm close, but because of inherent foundational differences in the design and purpose of XMPP vs Redis, our code will simply have to be more complicated going forward. IMO, the major issues in (and the state of my changes compared to) origin/main of the opensrf repo, re redis are: * It's extremely complicated and labor intensive (and maybe impossible, but I only tried to make it work for a couple days) to configure multiple, separate but interacting OpenSRF domains across different Redis servers. At the other end of the spectrum, it's also impossible to configure multi-tenant redis servers. -- This is mainly a /configuration capabilities/ issue, not primarily a code issue, because Bill did add OpenSRF usernames and domains (xmpp domains, before; hosts that run redis, now) to the redis keys used by EG. The structure of the keys is not future-proof and doesn't follow redis key space pattern recommendations (at least WRT planning for Redis-level clustering, HA, and LB), but since it exists today we should be able to change the key structure later at a breaking upgrade event (or, whenever we want, if OpenSRF is merged into EG). However, having the "bus" account configuration duplicated externally, and configured using a single static file, is not tenable. ++ I've addressed this by adjusting the redis config requirements a little, and providing three new configuration modes, targeting use cases of different complexity/need: 1) Instead of leaving the redis server open and unprotected by default and trying to find the password in the "bus accounts" file, the Redis "requirepass" setting is used to supply the password for the "default" (admin/root/whatever) user. 2) osrf_control can receive that password from a) the REDISCLI_AUTH env variable -- generally securable from outside. b) a dedicated file's content -- at least the file can be locked down to a specific unix user. c) a command line option -- meh, handy for manual use, but shows up in `ps`. d) extracted from the "bus accounts file" from before, for back-compat. 3) Made configuring Redis users/ACLs more flexible: a) the existing "bus accounts file" mechanism continues to exist, but because the same file is applied to each domain it's not safe for an HA/LB env because it it's not domain- or user-aware. b) a TT2 template can be supplied; it is processed for each domain separately, so complicated setups can be encoded in the template -- this is intended to provide an HA/LB-safe version of (a). c) osrf_control can dynamically create the necessary ACLs for the router, service, client, and gateway users and keys specific to each domain -- this is the mechanism that has the broadest set of use cases, I think. d) OpenSRF can be told that Redis' built in ACL infrastructure (the "aclfile" Redis config file setting, and friends) will just handle it, and a bus reset request just issues an "ACL LOAD" command to tell redis to refresh ACLs in its native way -- this mechanism provides the most logical separation, and I think will be useful in highly controlled/automated environments that want to make use of the Redis-developer-intended tools for ACL config. * LB (cross-registration of OpenSRF domains) does not work -- The register and unregister commands add additional instances to an internal list of endpoints for each service, but the router always uses the first entry in the list. The effect is that all traffic gets shoveled to the first-registered instance (not necessarily the local one, mind) until that instance actively deregisters, then it moves to the next one that registered. ++ I've added list rotation. That works and is an obvious fix, of course, but it points out that the code is definitely not fully baked or feature-tested, and it's lacking existing fault tolerance at an infrastructure level. * HA does not work, and LB (when fixed as above) is not safe -- Even after addressing the LB part of the cross-registration functionality, there is no way to detect that a service instance previously registered is no longer available and should be removed from the delivery list. Because we're using redis LISTs to stand in for (effectively) stateful TCP sockets and receive buffers, we end up just tossing requests into the void and hoping that someone comes along to service them. Put another way, if a listener dies, we have no way of detecting that at the OpenSRF level and accounting for the failure. This makes LB /more/ dangerous: think something akin to split-brain DNS problems. Because we can't trust either our internal state or the message delivery information from redis. This is also something that we got 100% for free in XMPP, because message delivery to an actual endpoint was verified and we got an error when that failed, so we could resend to another service instance. Now the message just falls into the void on a LIST key that nobody is looking at. ++ I'm working on moving from LISTs to STREAMs for router and service keys. Other than the slight difference in surface-level commands, it's no harder to use streams than lists. What this will allow us to do is recheck the state of previously sent messages, and if 1) they're "stale" and 2) no service instance has claimed them for processing, we can retract the message from the stream, deregister the service instance behind the redis key on which the message went stale, and send it to another service instance. I have the baseline change from LISTs to STREAMs working now, modulo some debug-logging cleanup and chasing down a couple possible leaks and corner cases, but the redis docs are fighting me at every step. (Just ask separately if you want to hear more about that.) I also have a proof of concept version of the message retraction and resend code, but I really want to rewrite that using what I've learned (*sad face*) in the last few weeks about redis. * Infrastructure-level clustering isn't possible -- Whether ejabberd or Redis, infrastructure clustering (transparent HA at the infrastructure level) isn't "easy", and the hard parts have to live somewhere... In the XMPP world, that was mostly ejabberd's problem and it handled it well. Redis has the concept of clustering, but (so far) we've chosen to not only ignore that, but to construct things in such a way that the redis cluster stuff /cannot be used effectively/. I have no proof-of-concept code to address this, yet. We may never have the option to configure things to be as transparently robust in the redis world as we do today with ejabberd. That may not matter to most people most of the time, but it's a point I feel compelled to raise because it's definitely a loss to admins of large, complex, heavily automated installations (even if they're not aware of that loss). I'll be pushing up a branch covering the first two points this week or next, and hopefully be able to follow up with the HA fixes ASAP. Thanks for following my rant this far... :) -- Mike Rylander Research and Development Manager Equinox Open Library Initiative 1-877-OPEN-ILS (673-6457) work: mi...@equinoxoli.org personal: mrylan...@gmail.com https://equinoxOLI.org On Tue, Jul 8, 2025 at 7:22 PM Jeff Davis via Evergreen-dev <evergreen-dev@list.evergreen-ils.org> wrote: > > We've been talking about calling our next major release Evergreen 4.0, rather > than 3.16. > > Is there a list of features that we want to include in a 4.0 release? Should > we hold off on bumping the version number to 4.0 until those features are > ready? > > Some candidates for "features that warrant going to 4.0": > - Making Angular circ the standard circ UI, rather than experimental. My > understanding is that we don't expect that to happen in the next release. > - Merging OpenSRF into Evergreen (LP#2032835). We were waiting to replace > ejabberd with Redis before doing that; Redis is now supported in Evergreen, > but I don't know if anyone has revisited merging OpenSRF into EG since then. > - There are a number of bugs targeted to "4.0-beta" in Launchpad, but AFAIK > they are just targeting the next major release, whether it's called 4.0 or > not. > > Any opinions? I would prefer to reserve "4.0" for a release that is somehow > "more" than just the next major release, but I recognize that version > numbering is basically arbitrary. > -- > Jeff Davis > BC Libraries Cooperative > _______________________________________________ > Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org > To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org _______________________________________________ Evergreen-dev mailing list -- evergreen-dev@list.evergreen-ils.org To unsubscribe send an email to evergreen-dev-le...@list.evergreen-ils.org