Hi! >>> Any suggestions for how this is supposed to work? Or is it just broken >>> by design, but saved because normally slave threads do not need to >>> access SHOW STATUS or system variables?
<cut> > Hm. So I went through all of the code that references LOCK_active_mi to try > and understand this. It seems there are two main uses: > > 1. As you say, to serialise admin commands. I think the list is: end_slave() > (server shutdown); CHANGE MASTER; START/STOP SLAVE; START/STOP ALL SLAVES; > RESET SLAVE; FLUSH SLAVE. > > 2. To protect access to master_info_index, eg. to prevent a Master_info from > disappearing while it is accessed. When reading about this, I think it would be better to have a counter in Master_info if someone is using it and only delete if if the counter is zero. Would be trivial to add an increment of the counter in get_master_info(). The counter could either be protected by the LOCK_active_mi or one of the mutexes in Master_info like data_lock. > There is a comment in rpl_rli.h that LOCK_active_mi protects > Relay_log_info::inited, but I did not understand it - maybe it is wrong? Setting inited is indirectly protected by LOCK_active_mi, as init_all_master_info() which calls init_master_info(), is protected by LOCK_active_mi. Looking at the code, I didn't however see any need to protect inited as this is an internal flag that is always 1 after start. It's main usage is to avoid some re-initialization of relay logs on retries. > So the idea would be to make a new lock for (1), and keep LOCK_active_mi for > (2). > > Actually, using a lock for (1) seems a bad idea - STOP SLAVE can take a long > time to run, and a mutex makes it unkillable. A condition variable might be > better (to make STOP SLAVE killable). But I guess that is a separate issue, > it would not affect the possibility of deadlocks. Agree that a condition variable would be better, but not critical as one has already a problem if stop_slave doesn't work. > Both LOCK_active_mi and the new LOCK_serialize_replication_admin_commands > must by themselves prevent master_info_index from having elements added or > removed. This is needed for start_all_slaves() and stop_all_slaves() to work > correctly, I think. Why both? Isn't it enough to first take LOCK_serialize_replication_admin_commands and then take LOCK_active_mi if one needs to access master_info_index ? > Something will need to be done for remove_master_info(). Currently, it also > tries to stop slave threads (from within free_key_master_info()) while > holding LOCK_active_mi. It seems that threads should be stopped before > attempting to remove their mi from master_info_index, probably? To do that we would need to: - Add a flag in Master_info that it's not usable anymore and change this flag only under LOCK_active_mi. get_master_info() could wait (on a condition) if this flag is set. - Add a counter that Master_info is in use. - Add a function to release Master_info. - Call terminate_slave_threads() outside of free_key_master_info() > Actually, there would still be the deadlock problem after introducing the > new lock, because it is possible for a slave thread to run CHANGE MASTER! > (and I think START SLAVE / RESET SLAVE as well). But that is probably a bug, > does not seem good that this is possible. I guess these need a check to > throw an ER_LOCK_OR_ACTIVE_TRANSACTION error, like STOP SLAVE. Wouldn't the new lock LOCK_serialize_replication_admin_commands fix that? > I think I also found another lock problem while reading the code, for > MASTER_POS_WAIT() og SHOW BINLOG EVENTS. They both grab a Master_info under > LOCK_active_mi, but then proceed to access it after releasing the lock. I do > not see anything that prevents the Master_info to disappear under their feet > if RESET SLAVE ALL is run in another thread? But again, that's a separate > issue, I suppose. Probably there are some other tricky deadlock issues > lurking here. I checked the function mysql_show_binlog_events() but could not find any access to Master_info after mysql_mutex_unlock(&LOCK_active_mi) was released. > I don't know. It seems splitting into a new > LOCK_serialize_replication_admin_commands could solve the deadlock issue, > maybe, if the other problems mentioned above are addressed. It seems very > hard to convince oneself that this would not introduce new problems > somewhere, the locking in replication is really complex and feels very > fragile :-/ It doesn't really feel like something one would want to do in > 10.1 (certainly not 10.0), but maybe 10.2? > > It's kind of a very edge-case problem (and seems to have been there for a > _long_ time). Still, hanging the mysqld process hard, requiring kill -9, is > also not nice... Adding two new flags, one if master_info is in use and one if it's unusable, shouldn't be that hard to make reasonable safe in 10.1 I am now back in Finland and working. Feel free to call me to discuss this any time. Regards, Monty _______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp