Sergei Golubchik <s...@mariadb.org> writes:

>> Note that at this point, the group commit leader thread _is_
>> effectively acting as if it was each of the participating threads, the
>> other threads are blocked from running during commit_ordered.
>
> I'm not sure it's good enough. Practically it might work, but if one
> thread has modified some variable under a mutex, you cannot simply read
> it from another thread without a mutex, there's no guarantee that
> another thread will see the new value. You'll need an appropriate memory
> barrier for that.

Sure, I understand that, maybe I should have explained this code in more
detail.

This is the binlog group commit. We have N transactions doing commit at the
same time: T1, T2, ..., TN.

After writing to the binlog, we must run commit_ordered(T_i) for each
transaction in order, i=1, 2, ..., N. The wait_for_prior_commit mechanism
exists for this. Each thread will do:

  wait_for_prior_commit(T_{i-1})
  commit_ordered(T_i)
  wakeup_subsequent_commits()

But this is slow, since it requires thread scheduling between N threads,
in-order:

  https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit/

Instead, what we do is that T_1 will call commit_ordered() for _all_ the N
transactions. Meanwhile the other threads are waiting to be signalled.

Note that LOCK_wait_commit protects the hand-over from thread T_i to T_1 and
back, so we _do_ have proper mutex protection. This mutex provides the
necessary memory barrier between T_i modifying data and T_1 reading it, or
T_1 modifying and T_i reading.

Does that make it clear?

>> If there is some reason, then the correct fix could be to set
>> current_thd for the duration of run_commit_ordered, which will satisfy
>> the assertion and presumably the actual bug?
>
> Setting current_thd is cheating, it will trick the assert, but won't fix
> the underlying problem as outlined above.

Nono, it's not cheating, it's in fact setting the correct value. While
running commit_ordered(T_i), the current THD _is_ the THD for T_i, not the
THD for T_1, which is unrelated to transaction T_i.

It's already done for a similar case a bit earlier in the code, where T_1 is
writing the binlog for each of the other threads:

  set_current_thd(current->thd);

It's similar to the pool-of-threads. If a THD goes idle for some time, and
then is resumed on a different thread in the thread pool, this does not mean
that every access to the data must now be done under LOCK_thd_data. Rather,
now the THD is owned by another thread, so this thread sets current_thd
accordingly.

The group commit code is similar, the THD is (temporarily) moved to be under
a different thread.

Does that make sense?

> Monty asserted many times that taking an uncontended mutex is
> cheap. Did you think it's expensive or did you benchmark that?

I did, actually! (in the past). Taking a mutex needs to reserve the
associated cache line, which is not free. "Expensive" is relative, of
course.

But here, LOCK_thd_data serves no purpose, and it's done under a very
critical global lock LOCK_commit_ordered. So surely we should not be
taking/releasing N extra mutexes here without any reason to do so?

For the same reason, set_current_thd() was not done in this code, to avoid
it if it serves no purpose. This is documented on the commit_ordered method
in sql/handler.h:

  "Note that commit_ordered() can be called from a different thread than the
   one handling the transaction! So it can not do anything that depends on
   thread local storage"

That is why I asked if you knew what the real problem is that triggered this
assertion. There is no reason to add code in a performance-critical place if
it serves no purpose. Worse, it might hide the real bug if some code was
added to commit_ordered that violates the above requirement.

I think it is fine to add set_current_thd() *if it is actually needed*.
But we should understand first *why* it is needed, other wise we may hide
the real bug.

 - Kristian.
_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

Reply via email to