Re: [Maria-developers] Review of MDEV-4506, parallel replication, part 1

Michael Widenius Sun, 22 Sep 2013 07:20:22 -0700

Hi!

>>>>> "Kristian" == Kristian Nielsen <[email protected]> writes:

Kristian> Monty,
Kristian> Thanks for the comments so far.

Kristian> I have pushed fixes to 10.0-knielsen, and some specific replies are 
below.

Thanks. I will review that today.

Kristian>  - Kristian.

Kristian> Michael Widenius <[email protected]> writes:

>> + --slave-parallel-threads=# 
>> + If non-zero, number of threads to spawn to apply in
>> + parallel events on the slave that were group-committed on
>> + the master or were logged with GTID in different
>> + replication domains.
>> 
>> What happens if --slave-parallel-threads is 0 ?
>> Does this mean that there will be no parallel threads?
>> Is this the same as 1 ?
>> 
>> If same as 1, why not have the option to take values starting from 1 ?

Kristian> I think it is logical that 0 means disabled and >0 means enabled with 
that
Kristian> many threads. But I don't mind using 1 as disabled and needing >=2 for
Kristian> enabled. Though the ability to use a pool with just 1 thread might be 
useful
Kristian> for testing. It doesn't seem critical, maybe we can revisit this if 
time
Kristian> allows when the basic stuff is done. 

I come up with one case where 0 is better
- In multi-source 0 would mean that we have one thread per connection.

You can't get that behaviour if we change the meaning of the
slave-parallel-thread variable.

>> === modified file 
>> 'mysql-test/suite/perfschema/r/dml_setup_instruments.result'
>> --- mysql-test/suite/perfschema/r/dml_setup_instruments.result       
>> 2013-03-26 09:35:34 +0000
>> +++ mysql-test/suite/perfschema/r/dml_setup_instruments.result       
>> 2013-08-29 08:18:22 +0000
>> @@ -38,14 +38,14 @@ order by name limit 10;
>> NAME ENABLED TIMED
>> wait/synch/cond/sql/COND_flush_thread_cache  YES     YES
>> wait/synch/cond/sql/COND_manager     YES     YES
>> +wait/synch/cond/sql/COND_parallel_entry     YES     YES
>> +wait/synch/cond/sql/COND_prepare_ordered    YES     YES
>> wait/synch/cond/sql/COND_queue_state YES     YES
>> wait/synch/cond/sql/COND_rpl_status  YES     YES
>> +wait/synch/cond/sql/COND_rpl_thread YES     YES
>> +wait/synch/cond/sql/COND_rpl_thread_pool    YES     YES
>> wait/synch/cond/sql/COND_server_started      YES     YES
>> wait/synch/cond/sql/COND_thread_cache        YES     YES
>> -wait/synch/cond/sql/COND_thread_count       YES     YES
>> -wait/synch/cond/sql/Delayed_insert::cond    YES     YES
>> -wait/synch/cond/sql/Delayed_insert::cond_client     YES     YES
>> -wait/synch/cond/sql/Event_scheduler::COND_state     YES     YES
>> 
>> Any idea why the above was deleted?
>> Don't see anything in your patch that could affect that.

Kristian> I believe it does a SELECT with a LIMIT - so when new stuff is added 
before
Kristian> the limit, stuff below gets pushed out.

Kristian> But if you ask why the test is that way, I cannot answer. The purpose 
of some
Kristian> of those perfschema tests is somewhat of a mystery, they seem to 
serve little
Kristian> other purpose than add maintenance cost for result file update 
whenever
Kristian> something changes ...

ok, I missed the 'limit part'. I can only say 'stupid test' about this...

<cut>

>> @@ -6541,44 +6543,201 @@ MYSQL_BIN_LOG::write_transaction_to_binl
>> }
>> 
>> Could you please document in detail the arguments and return value
>> for the following function.

Kristian> I added the following comment:

Kristian> /*
Kristian>   Put a transaction that is ready to commit in the group commit queue.
Kristian>   The transaction is identified by the ENTRY object passed into this 
function.

Kristian>   To facilitate group commit for the binlog, we first queue up 
ourselves in
Kristian>   this function. Then later the first thread to enter the queue waits 
for
Kristian>   the LOCK_log mutex, and commits for everyone in the queue once it 
gets the
Kristian>   lock. Any other threads in the queue just wait for the first one to 
finish
Kristian>   the commit and wake them up. This way, all transactions in the 
queue get
Kristian>   committed in a single disk operation.

Kristian>   The return value of this function is TRUE if queued as the first 
entry in
Kristian>   the queue (meaning this is the leader), FALSE otherwise.

Kristian>   The main work in this function is when the commit in one 
transaction has
Kristian>   been marked to wait for the commit of another transaction to happen
Kristian>   first. This is used to support in-order parallel replication, where
Kristian>   transactions can execute out-of-order but need to be committed 
in-order with
Kristian>   how they happened on the master. The waiting of one commit on 
another needs
Kristian>   to be integrated with the group commit queue, to ensure that the 
waiting
Kristian>   transaction can participate in the same group commit as the 
waited-for
Kristian>   transaction.

Kristian>   So when we put a transaction in the queue, we check if there were 
other
Kristian>   transactions already prepared to commit but just waiting for the 
first one
Kristian>   to commit. If so, we add those to the queue as well, transitively 
for all
Kristian>   waiters.
Kristian> */
Kristian> bool
Kristian> MYSQL_BIN_LOG::queue_for_group_commit(group_commit_entry *entry)

Kristian> I also moved some code from the caller into this function, which 
clarifies
Kristian> things by keeping related code together. And I added more comments 
about what
Kristian> is going on with the different list pointers, and who is doing what.

Thanks. That will help a lot in making thins easier to understand.

<cut>

>> I did find it confusing that here you threat next_subsequent_commit as
>> a way to avoid recursion, but you use this list for other things in
>> register_wait_for_prior_commit().

Kristian> I added this comment:

Kristian>     We keep a list of all the waiters that need to be processed in 
`list',
Kristian>     linked through the next_subsequent_commit pointer. Initially this 
list
Kristian>     contains only the entry passed into this function.

Kristian>     We process entries in the list one by one. The element currently 
being
Kristian>     processed is pointed to by `cur`, and the element at the end of 
the list
Kristian>     is pointed to by `last` (we do not use NULL to terminate the 
list).

Kristian>     As we process an element, it is first added to the 
group_commit_queue.
Kristian>     Then any waiters for that element are added at the end of the 
list, to
Kristian>     be processed in subsequent iterations. This continues until the 
list
Kristian>     is exhausted, with all elements ever added eventually processed.

Kristian>     The end result is a breath-first traversal of the tree of waiters,
Kristian>     re-using the next_subsequent_commit pointers in place of extra 
stack
Kristian>     space in a recursive traversal.

Great. The one thing I am missing was:

The temporary list created in next_subsequent_commit is not
used by the caller or any other function.

>> +  /* Now we need to clear the wakeup_subsequent_commits_running flags. */
>> +  if (list)
>> +  {
>> +    for (;;)
>> +    {
>> +      if (list->wakeup_subsequent_commits_running)
>> +      {
>> +        mysql_mutex_lock(&list->LOCK_wait_commit);
>> +        list->wakeup_subsequent_commits_running= false;
>> +        mysql_mutex_unlock(&list->LOCK_wait_commit);
>> 
>> Why do we need the mutex above?
>> The only place where we test this variable, as far as I can find, is in
>> wait_for_commit::register_wait_for_prior_commit()

Kristian> Yes, it looks strange. Unfortunately I cannot remember the reason 
100% right
Kristian> now (I wrote this code 9 months ago). I remember that there _was_ a 
reason. I
Kristian> should have written a comment to explain this, shame on me!

Kristian> I think the code basically needs to do the same as
Kristian> wait_for_commit::wakeup_subsequent_commits2(), but integrated with 
the tree
Kristian> traversal.

Kristian> I agree this needs clarification and looks suspicious, but I wanted 
to get the
Kristian> rest of the comments sent to you today, I will try to re-visit this 
later (and
Kristian> the similar code in wakeup_subsequent_commits2()).

ok. Please take a look at this when you are ready with other things.
(Add at least a 'todo tag to this code).

<cut>

>> A much easier way to achive the wait is to have the caller lock
>> mysql_mutex_lock(&rpt->LOCK_rpl_thread) and then just release it
>> when it's time for this thread to continue.
>> - No need for signaling or checking if the thread got the signal
>> - No need to have a cond wait here
>> - No need to have a delay_start flag

Kristian> I consider this method simpler and clearer. But it is not critical 
code, feel
Kristian> free to change it if you like.

I will do this change, probably today.
It will make some of the code a lot simpler. Wait and see...

<cut>

>> Another question: do we really need rpt->stop ?
>> Isn't it enough with using the thd->killed for this ?

Kristian> I think there is a need to generally go through and check for proper 
handling
Kristian> of thd->killed, not just here but in all cases. I did not have time 
to do this
Kristian> yet, and I am in any case a bit unsure about exactly how kill 
semantics should
Kristian> work.

Kristian> Now that I think about it, shouldn't they remain different?

Kristian> I think thd->killed is set if user does an explicit KILL of the 
thread? This
Kristian> should then abort and fail the currently executing transaction. But 
it should
Kristian> not cause the thread to terminate, should it? That could leave us 
with no
Kristian> threads at all in the replication thread pool, which would cause 
replication
Kristian> to hang. Or should the thread exit and be re-spawned?

What should happen if you kill a replication thread is that
replication should stop for that master.

Kristian> This needs more thought, I think ... certainly something looks not 
right.

After looking at the full code, I think that the logical way things
should work is:

'stop' is to be used when you want to nicely take done replication.
This means that the current commit groups should be given time to
finish.

thd->killed should mean that we should stop ASAP.
- All not commited things should abort.

This is needed in a 'panic shutdown' (like out soon-out-of- power) or
when trying to kill the replication thread when one notices that
something went horribly wrong (like ALTER TABLE stopping replication).

>> === modified file 'sql/sql_class.cc'
>> --- sql/sql_class.cc 2013-06-06 15:51:28 +0000
>> +++ sql/sql_class.cc 2013-08-29 08:18:22 +0000
>> 
>> +void
>> +wait_for_commit::wakeup()
>> +{
>> +  /*
>> +    We signal each waiter on their own condition and mutex (rather than 
>> using
>> +    pthread_cond_broadcast() or something like that).
>> +
>> +    Otherwise we would need to somehow ensure that they were done
>> +    waking up before we could allow this THD to be destroyed, which would
>> +    be annoying and unnecessary.
>> +  */
>> +  mysql_mutex_lock(&LOCK_wait_commit);
>> +  waiting_for_commit= false;
>> +  mysql_cond_signal(&COND_wait_commit);
>> +  mysql_mutex_unlock(&LOCK_wait_commit);
>> +}
>> 
>> In this particular case it looks safe to move the cond signal out of
>> the mutex. I don't see how anyone could miss the signal.

Kristian> Probably. Generally I prefer to write it as above, to not have to 
make sure
Kristian> that moving it out is safe.

I agree that this is the way to do in general.
However for cases where we are waiting almost once per query, we need
to make things faster as this extra wakup will take notable resources!

>> +  The waiter needs to lock the waitee to delete itself from the list in
>> +  unregister_wait_for_prior_commit(). Thus wakeup_subsequent_commits() can 
>> not
>> +  hold its own lock while locking waiters, lest we deadlock.
>> 
>> lest ?

Kristian>     https://en.wiktionary.org/wiki/lest

Kristian> But I've changed it to use a more common English wording "as this 
could lead
Kristian> to deadlock".

Thanks. It's easier to read the code when you don't need to consult a
dictionary...

>> +  The reason for separate register and wait calls is that this allows to
>> +  register the wait early, at a point where the waited-for THD is known to
>> +  exist. And then the actual wait can be done much later, where the
>> +  waited-for THD may have been long gone. By registering early, the waitee
>> +  can signal before disappearing.
>> +*/
>> 
>> Instead of taking locks to ensure that THD will disapper, why not keep
>> all THD objects around until all transactions in a group has been
>> executed?

Kristian> I think it is not just about THD disappearing. IIRC, the more 
important
Kristian> problem is that the THD may have moved on to do something different. 
So even
Kristian> if the data in the other THD is not freed, it may still contain 
completely
Kristian> unrelated data (and thus wrong data, for the accessing thread).

Kristian> But this is actually something I consider a very important thing, and 
one that
Kristian> I spent quite some effort on and was quite pleased to get in the 
current
Kristian> state. If arbitrary THD can have pointers into arbitrary other THD, 
the
Kristian> life-time issues can quickly get very nasty, and it is important to 
handle
Kristian> this well to avoid having to understand the entire codebase to avoid 
subtle bugs.

ok. Thanks for the explanation.

Regards,
Monty

_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp

Re: [Maria-developers] Review of MDEV-4506, parallel replication, part 1

Reply via email to