have two servers in AWS. One is a live production server (a multi site
WordPress installation with hundreds of sites and about 5,000 users) and
the other is a clone of prod that is being configured for a test server.
The live one has four array servers, an Elastic Load Balancer and is
connected to a large RDS in AWS. And until yesterday, I naively thought
our caching was being handled via APC and a WordPress plugin here and
there. But no. Turns out someone here had added AWS's ElastiCache to our
live server. Essentially, ElastiCache is memcache for those not in the
cloud.
Anyway, we tried to enable caching on our test server two days ago and it
introduced a really strange bug (a redirect mysteriously appeared on our
live site's main admin dashboard that then went to our test server). So
once we realized the bug was most likely related to a caching system we
didn't know we had, we disabled caching. As it turned out, when we enabled
caching on our test server, it used the same Elasticache server our live
server was using (because test was a clone of live). So we disabled it
when we removed/renamed the object-cache.php file.
Disabling it solved our redirect issue, but suddenly, many (not all) of our
5,000 users could no longer log into their individual sites. For some
reason, the values that were in our database were not working for a good
percentage of users, forcing them to have to reset their passwords instead.
Obviously, this is huge with 5,000 users in the mix. So we reenabled
caching on our live instance and decided to fix our cached redirect with WP
configuration changes instead (we added define('RELOCATE',true); into the
config to force the redirection to our test server to be overridden).
One of the things we noticed with memcache was that it kept updating our
wp_options table with the domain for the test server in place of our live
one. In fact, it's still doing it whenever I run a query to find the
string for the test domain and update it to the live domain. Every few
minutes, the caching changes it back. Scary. But it looks like our
configuration change for now forces an override. The really concerning
thing about all this was the fact that it seems memcache is drawing from
its own key:value pairs for the user passwords instead of directly from the
database. I mean with caching enabled, the users can get in. Without it,
many users are forced to reset their passwords.
Does anyone have any ideas for me as to how to effectively understand
what's going on with memcache in this case and how to fix it so the
database gets written to appropriately and so password info isn't just
being held in the cache? To my thinking it's a ticking time bomb. All it
would take is one flush_all command to make life very, very painful for
most of my users.
We are on Nginx with MySQL on the RDS.