[MariaDB developers] Fwd/Re: Starting on review of XA patches in bb-10.6-MDEV-31949

Andrei via developers Mon, 05 Feb 2024 04:51:50 -0800

From: Andrei Elkin <andrei.el...@mariadb.com>
Subject: Re: Starting on review of XA patches in bb-10.6-MDEV-31949
To: Kristian Nielsen <kniel...@knielsen-hq.org>

Cc: developers@lists.mariadb.org, andrei elkin<andrei.el...@mariadb.com>,

  Michael Widenius <michael.widen...@gmail.com>
Date: Mon, 05 Feb 2024 14:11:01 +0200 (36 minutes, 34 seconds ago)
Organization: MariaDB


Kristian Nielsen <kniel...@knielsen-hq.org> writes:

Hi Andrei,

It's not an easy review of the XA patches in bb-10.6-MDEV-31949.
One thing that would have helped is a clear design document that listsall
the different concerns and points that need to be considered around
recovery, backup, replication, possible conflicts, etc. This could thenbereferenced by the reviewer to see if a given concern was considered,and if
so what the resolution is.


I am sorry for discomfort the lack of the above caused.
Unfortunately the IV recovery part has consumed more time than was
expected (which is at the final stage of my local testing).

But I'll try as best I can with what we
have.


If you'll need any online discussion to compensate for that I'd be glad
to do it.

As I'm trying to understand the different issues, I found making testcases
to check my understanding helped. I have pushed some to the branch
knielsen_mdev31949_review.


Thank you for streamlining it this way!

They seem to show some problems in the current
patch:

  mysql-test/suite/rpl/t/rpl_recovery_xa.test
- A trx is left in "XA PREPARED" state after crash recovery, eventhoughthe binlogged event group is incomplete (it should have beenrolled
     back).
   - A trx is committed after crash recovery, even though it is not
     binlogged (it should have been rolled back).
- An incomplete XA event group is output by mysqlbinlog with a"ROLLBACK"
     at the end which gives an error (should probably be XA ROLLBACK).

  mysql-test/suite/rpl/t/rpl_recovery_xa2.test
- An XA PREPAREd transaction is lost after recovery when it wasbinlogged
     in an earlier binlog file. (I think this might be the issue that's
     referenced as a ToDo with Xid_list_log_event in patch?)

  mysql-test/suite/mariabackup/slave_provision_xa.test
- An XA PREPARE does not update the binlog position inside InnoDB.So aslave provisioned with mariabackup from this starting position canfail
     with duplicate apply of XA PREPARE.
Some other initial remarks follow. Since there's a desire to progresswiththis quickly, I thought it would be useful to send these early, so youcanstart looking/discussions before I complete building an understandingof the
whole design and code picture.


Super!

diff --git a/storage/innobase/handler/ha_innodb.cc
b/storage/innobase/handler/ha_innodb.cc
index 17e512bf34e..a4cdc72e18c 100644
--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -17144,8 +17144,16 @@ innobase_commit_by_xid(
        }

        if (trx_t* trx = trx_get_trx_by_xid(xid)) {
+               THD *thd = current_thd;
+               if (thd)
+                       thd_binlog_pos(thd, &trx->mysql_log_file_name,
+                                      &trx->mysql_log_offset);

I guess this is necessary to get the correct binlog position storedinside

innodb to match the binlogged XA COMMIT event, right?


Correct.

But will it work if called during crash recovery?


Right. The unpushed part IV makes sure of that.

During crash recovery, we could actually update the position, but onlyifthis is trx is the last event group in the old binlog. Maybe that'sbetterhandled elsewhere (there are currently other cases where the binlogposition
in InnoDB can be wrong immediately after restart until the first new
transaction is committed).
+               if (trx->mysql_log_offset > 0)
+                       trx->flush_log_later = true;
But I'm guessing this is the motivation. If called from within binloggroupcommit, you want to disable fsync in InnoDB. But if called fromrecovery, orwhen binlog is disabled, this condition will be false, and fsync willbe
done.


Indeed.


If so, that's a _really_ obfuscated way to do it :-/


I remember it was a very quick decision for myself to miss this sort of
reflection.

[todo] I'll add comments of course.

It surely needs a very clear comment explaining what is going on.

And it is very fragile. What if thd_binlog_pos() is changed at somepoint,

to slightly change the conditions under which it returns 0 or not? That
could silently break this code.


So you're in this context

 1 innobase_commit_by_xid(
 2 /*===================*/
 3     handlerton*    hton,

4 XID* xid) /*!< in: X/Open XA transaction identification*/

 5 {
 6     DBUG_ASSERT(hton == innodb_hton_ptr);
 7
 8     DBUG_EXECUTE_IF("innobase_xa_fail",
 9             return XAER_RMFAIL;);
10
11     if (high_level_read_only) {
12         return(XAER_RMFAIL);
13     }
14
15     if (trx_t* trx = trx_get_trx_by_xid(xid)) {
16         THD *thd = current_thd;
17         if (thd)
18             thd_binlog_pos(thd, &trx->mysql_log_file_name,
19                        &trx->mysql_log_offset);
20         if (trx->mysql_log_offset > 0)
21             trx->flush_log_later = true;

and actually considering `trx->mysql_log_offset` as the return value?
... I could not grasp your way of thinking in this place.
Could you please expand on?

It would be much better if this could be done similar to how normalcommitsare done. We have the commit_ordered handlerton which does the fastpart of
commit-to-memory. And then the slow part which does the rest in commit
handlerton.

Notice that the *slow* part of the XA commit is the same as in thenormal

case:

trx_commit_complete_for_mysql
*innobase_commit*
commit_one_phase_2
ha_commit_one_phase

The fast parts of the two - the XA and normal - could be unified - into

  innobase_commit_ordered()

but I did not think (had time for) towards that.

So we could have commit_by_xid_ordered() and commit_by_xid(). Theproblem isthat commit_by_xid_ordered() needs to release the xid, so it becomesharderto get hold of the transaction (which the commit handlerton just findsfrom
the thd). Maybe commit_by_xid_ordered() would need to return some trx
handle.
What part of the commit is skipped by this in InnoDB? Is it _only_flushing
the log up to the LSN of the commit record, or are there more things?


So the xa's fast innobase_commit_by_xid() skips a "slow"
  trx_commit_complete_for_mysql()
and the same does the normal's innobase_commit_ordered().

How is another storage engine supposed to handle this? This also needsto be
documented somewhere, eg. in comments on the handlerton calls.
This also means that the --innodb-flush-log-at-trx-commit=3 is notsupportedfor XA COMMIT when it's done from a different session that the XAPREPARE.For example, with --innodb-flush-log-at-trx-commit=3 but--sync-binlog=0,
normal transactions are still durable in InnoDB. But external XA
transactions will not be. This could be solved if the commit_by_xid()was
split into a fast and slow part similar to normal commit.


[todo] Let me process that carefully later.

Probably you'll say that this works for now, and perhaps that's true.But atleast it really _really_ needs some detailed comments explaining whatisgoing on, and what should be kept in mind if modifying/improving thiscodein the future, otherwise it will be impossible for the next developerto
understand.


[todo] Agreed.

diff --git a/sql/log.cc b/sql/log.cc
index e54d4087d46..7e3dbeaef29 100644
--- a/sql/log.cc
+++ b/sql/log.cc
@@ -9365,18 +9381,84 @@ TC_LOG::run_commit_ordered(THD *thd, bool all)
all ? thd->transaction->all.ha_list :thd->transaction->stmt.ha_list;
   mysql_mutex_assert_owner(&LOCK_commit_ordered);
-  for (; ha_info; ha_info= ha_info->next())
+
+  if (!ha_info)
Your patch puts a lot of XA/binlog logic into run_commit_ordered(). But
that's not appropriate. The run_commit_ordered() function has thepurposesolely to call the commit_ordered() handlerton in each engine, and it'salsocalled from TC_LOG_MMAP (that's why it's TC_LOG::, notMYSQL_BIN_LOG::).
Instead make a new function that handles the part of binlog commit that
needs to be done under LOCK_commit_ordered. Be sure to comment it thatthisruns under this highly contended mutex, and needs to do as little workas
possible.
This function can then call run_commit_ordered(). It can also be usedto
share a few bits of code that are now duplicated between the
opt_optimize_thread_scheduling and !opt_optimize_thread_schedulingcases,
like:

    ++num_commits;
    run_commit_ordered(...)
    entry->thd->wakeup_subsequent_commits()
    if (entry->queued_by_other) { ... }

(if it makes sense to do so in the actual code, only if makes things
simpler, that's easier to decide when working with the actual code).

There sesems to be 3 new cases:

 - XA PREPARE
 - XA COMMIT from same session as XA PREPARE
 - XA COMMIT of detached XA transaction
The logic to decide this was already done earlier, when the event groupwasqueued for group commit, eg. to decide which end_event to use. Forexamplethere is no need to call xid_cache_search() here, that can be doneearlier(and it's also wrong to call my_error() here, as we're in the wrongthread
context).
So the logic about what to do should be placed higher up in the callstack,ideally where we already know, eg. binlog_commit_by_xid(). And errorssuchas ER_XAER_NOTA can be thrown already there, and the decision on whatto doput consisely as a flag into the struct group_commit_entry, so thelogic
during binlog group commit becomes as simple as possible.
(It might even make sense to simply store a function pointer to themethodthat should be called to handle this; normal, attached XA COMMIT,detachedXA COMMIT, XA PREPARE. But again it's easier when actually working withthe
code to decide if this is simpler than a couple flags or case on enum
value).


[todo] Let me process the above suggestion block a bit later.

+    if (thd->rgi_slave && thd->rgi_slave->is_parallel_exec)
+    {
+      DBUG_ASSERT(thd->transaction->xid_state.is_dummy_XA());
+
+      auto xs= xid_cache_search(thd, xid);
+      if(!xs)
+      {
+        my_error(ER_XAER_NOTA, MYF(0));
+        return;
+      }
+      else
+      {
+        thd->transaction->xid_state.xid_cache_element= xs;
+      }
+      DBUG_ASSERT(thd->transaction->xid_state.is_recovered());
+    }
+

plugin_foreach(NULL, commit ? xacommit_handlerton :xarollback_handlerton,

                    MYSQL_STORAGE_ENGINE_PLUGIN, &xaop);
     xid_cache_delete(thd);


Why this check for thd->rgi_slave->is_parallel_exec here deep in binlog
group commit? That seems very wrong, and again no comment whatsoever
explaining what is going on, so again I have to try to guess...

I'm guessing that it's something with calling xid_cache_delete() whichneeds

to find thd->transaction->xid_state.xid_cache_element ?


Right.

But this could be done much better in a higher-level part of the codethat's

specific to xa and (parallel) replication, not here deep down in binlog
group commit code.

And why should this be needed only for parallel replication, not for
non-parellel replication or non-replication SQL? I don't understand.


It's a kind of optimization for the slave side, in that `xid` of XAP
(Prepare) gets released for a XAC (Commit) the soonest.



A couple remarks about the tests I pushed to knielsen_mdev31949_review:

I think the failures in rpl.rpl_recovery_xa I linked are from duplicateXIDin XA PREPARE ; XA COMMIT ; XA PREPARE. There comes an ambiguitybetweenwhether the transaction prepared in the engine at recovery is from thefirstXA PREPARE (so it should be committed) or the second (so it should berolled
back). This happens when the second XA PREPARE is in the engine but not
binlogged.
One way to fix could actually be to use the binlog position storedinsideInnoDB. If the XA COMMIT is before this binlog position, then we knowthefirst transaction became durable in InnoDB, so we know we have to rollbackthe current prepared transaction in the engine. Otherwise we shouldcommitit. I see the recovery code already uses binlog positions to decide thefate
of pending XA transactions in the engines, but I did not yet fully
understand what is going on there.
Another way could be to binlog an extra event before the second XAPREPARE
with the duplicate XID. This event would mean "This xid is now durable
committed in the engine", so we would know not to commit any preparedtrx inthe engine with the same xid. This event would only be needed in caseof XAPREPARE with a duplicate xid. This though requires the prior XA COMMITto
fsync inside InnoDB first, I think.


Thanks for this piece of analysis!
[todo] I am processing this a bit later.

Another test failure in rpl.slave_provision_xa I think happens becausetheXA PREPARE does not update the binlog position inside innodb. Maybe theway
to solve this problem is to have the slave be idempotent, and silently
ignore any duplicate XA PREPARE if the xid is already prepared?


[todo] I am processing this a bit later.

About the "ToDo" of using some Xid_list_log_event. I think this is away
during recovery to get the state of the XA PREPARE/COMMIT's in earlier
binlog files before the last binlog checkpoint, if I understood things
correctly?


True.

Another way to do this is to delay the binlog checkpoints until all XA
PREPARE in a binlog file have been XA COMMIT'ted (or XA ROLLBACK'ed),sothat they will all be scanned during recovery. I actually implementedthisin my branch knielsen_mdev32020. The advantage of this is that we mayanywayneed this to provide the XA PREPARE events to a slave set up from alogical
backup (like mysqldump), to avoid purging XA PREPARE events too early.
And
then it's a bit simpler to simply scan an extra file rather than bothscanand read Xid_list_log_event. But I think Xid_list_log_event could workfine
as well, just mentioning it for you to consider.


Xlle can't be generally avoided so we have to take conservatively first.

That's it for now, I will continue trying to understand what is goingon and
complete the review.


Thank you for so prompt piece of feedback!
I'll return with a reply within a work day to [todo] tagged.

Cheers,

Andrei
----------

_______________________________________________
developers mailing list -- developers@lists.mariadb.org
To unsubscribe send an email to developers-le...@lists.mariadb.org

[MariaDB developers] Fwd/Re: Starting on review of XA patches in bb-10.6-MDEV-31949

Reply via email to