On Tuesday, October 22, 2002, at 07:49  AM, Matt Simerson wrote:

Apologies, I meant to click "SAVE" instead of send as I left work yesterday. Anyway, appended to this email is "The Rest of The Story".

On Monday, October 21, 2002, at 06:24 PM, Matt Simerson wrote:

OK, this will be fun to explain.

I have a large vpopmail install that I'm running this on so the system is quite busy. I recently converted this system from .cdb to MySQL authentication. In so doing, we've happened upon a few anomalies, the strangest one being that qmailadmin functions only worked intermittently.

We use MySQL replication and have a MySQL master running on a quad Xeon that is more than adequate hardware wise. We replicate to each mail server via MySQL replication. Vpopmail is then configured to write updates to the master and read from the slave on localhost. This works quite well for vpopmail.

So, upon noticing that qmailadmin wasn't working reliably, we started poking around and saw that every time we made a change in qmailadmin, we'd get two updates written to the MySQL master (as observed in binlog). The first would have all the values we wanted, the second would end up resetting those values to exactly what they were before the attempted change was made.

So, a little visit to the qmailadmin source code revealed the culprit in user.c.
If you read through user.c, you'll find a subroutine "modusergo". In there, is the guts of what happens when you click on "modify user" and then "submit". Reading through that, it appears that if we get a new password, we make a call to vpasswd. This function, I would guess is against libvpopmail.a and connects to the MySQL update server and updates the password, in addition to running crypt() against the supplied password. At this point, all is well.

However, in the very next function, the problem becomes manifest. We check for the value of gecos and if it exists (which it always does because it's value is in the HTML form) then it first makes a call to vauth_getpw. Testing has revealed that as expected, it does the database read from the MySQL read server on localhost. Unfortunately, it's making the query a tiny fraction of a second after writing the new password into MySQL update server. I'm sure Bill and Ken both know the problem now.

If you weren't using replication, this wouldn't be a problem (and isn't on my test servers) but what ends up happening is one SQL query arrives at the update server and sets the values that you changed appropriately. The next query arrives right behind it and resets the values to exactly what they were before the first query updated them.

So, the bandaid I currently applied is putting a 5 second sleep in there so that the first change gets a chance to replicate before we run vauth_getpw. However, this is a distinctly BAD way to fix it. There's no way to guarantee that a change made to the MySQL master has propagated to the slave. In 5 seconds it's a 99% probability but I'd argue that isn't good enough.

I can only think of one guaranteed way to fix it. That is to make sure that when qmailadmin does it's vauth_getpw call, it reads from the MySQL master. This is the only guaranteed way to assure that a successful database update written a fraction of a second before gets read back.

This could be a even bigger problem for people that don't manage their MySQL replication well. Imagine having a farm of replicated servers where one is out of sync and the admins don't notice. A (l)user hits the server that's database isn't replicated and see's their old gecos field that they updated last week. The update it again and it resets their password back to their old one which they had updated the week before. That's not good manners.

So, anyone else have suggestions on the best way to fix this?
After only getting the first half of the story yesterday, I duplicated your process to see if I could figure it out. The strange thing to me was that there are I think 3 select queries and 2 update queries that occur before modusergo is finished. This seems excessive. There really should just be one select, and one update. All the error checking and modifications to the vpw structure should happen in qmailadmin, with a single update at the end. I think that refining this will be the best solution to the problem. However, the following select during moduser (called by the result page after modifying the user) might still get old data. I'm not sure how long replication takes. Did you experiment with sleeps less than 5 seconds? I vaguely recall someone else having to put a sleep in either qmailadmin or vqadmin for this same reason. Maybe another solution would be to refresh back to the main menu, or user list after modifying. That would force a delay before moduser() does the next select for the result page.

I'll look at it today and see if I can improve it.

Regards,

Bill Shupp




Reply via email to