22.09.2021 22:19, Vladimir Sementsov-Ogievskiy wrote:
22.09.2021 19:05, Richard Henderson wrote:
On 9/21/21 3:20 AM, Vladimir Sementsov-Ogievskiy wrote:
The following changes since commit 326ff8dd09556fc2e257196c49f35009700794ac:
Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into
staging (2021-09-20 16:17:05 +0100)
are available in the Git repository at:
https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-jobs-2021-09-21
for you to fetch changes up to c9489c04319cac75c76af8fc27c254f46e10214c:
iotests: Add mirror-ready-cancel-error test (2021-09-21 11:56:11 +0300)
----------------------------------------------------------------
mirror: Handle errors after READY cancel
----------------------------------------------------------------
Hanna Reitz (12):
job: Context changes in job_completed_txn_abort()
mirror: Keep s->synced on error
mirror: Drop s->synced
job: Force-cancel jobs in a failed transaction
job: @force parameter for job_cancel_sync()
jobs: Give Job.force_cancel more meaning
job: Add job_cancel_requested()
mirror: Use job_is_cancelled()
mirror: Check job_is_cancelled() earlier
mirror: Stop active mirroring after force-cancel
mirror: Do not clear .cancelled
iotests: Add mirror-ready-cancel-error test
This fails testing with errors like so:
Running test test-replication
test-replication: ../job.c:186: job_state_transition: Assertion
`JobSTT[s0][s1]' failed.
ERROR test-replication - too few tests run (expected 13, got 8)
make: *** [Makefile.mtest:816: run-test-100] Error 1
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1
https://gitlab.com/qemu-project/qemu/-/pipelines/375324015/failures
Interesting :(
I've reproduced, starting test-replication in several parallel loops. (it
doesn't reproduce for me if just start in one loop). So, that's some racy bug..
Hmm, and seems it doesn't reproduce so simple on master. I'll try to bisect the
series tomorrow.
====
(gdb) bt
#0 0x00007f034a3d09d5 in raise () from /lib64/libc.so.6
#1 0x00007f034a3b9954 in abort () from /lib64/libc.so.6
#2 0x00007f034a3b9789 in __assert_fail_base.cold () from /lib64/libc.so.6
#3 0x00007f034a3c9026 in __assert_fail () from /lib64/libc.so.6
#4 0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020,
s1=JOB_STATUS_CONCLUDED) at ../job.c:186
#5 0x000055d3b503e7c2 in job_conclude (job=0x55d3b5e67020) at ../job.c:652
#6 0x000055d3b503eaa1 in job_finalize_single (job=0x55d3b5e67020) at
../job.c:722
#7 0x000055d3b503ecd1 in job_completed_txn_abort (job=0x55d3b5e67020) at
../job.c:801
#8 0x000055d3b503f2ea in job_cancel (job=0x55d3b5e67020, force=false) at
../job.c:973
#9 0x000055d3b503f360 in job_cancel_err (job=0x55d3b5e67020,
errp=0x7fffcc997a80) at ../job.c:992
#10 0x000055d3b503f576 in job_finish_sync (job=0x55d3b5e67020, finish=0x55d3b503f33f
<job_cancel_err>, errp=0x0) at ../job.c:1054
#11 0x000055d3b503f3d0 in job_cancel_sync (job=0x55d3b5e67020, force=false) at
../job.c:1008
#12 0x000055d3b4ff14a3 in replication_close (bs=0x55d3b5e6ef80) at
../block/replication.c:152
#13 0x000055d3b50277fc in bdrv_close (bs=0x55d3b5e6ef80) at ../block.c:4677
#14 0x000055d3b50286cf in bdrv_delete (bs=0x55d3b5e6ef80) at ../block.c:5100
#15 0x000055d3b502ae3a in bdrv_unref (bs=0x55d3b5e6ef80) at ../block.c:6495
#16 0x000055d3b5023a38 in bdrv_root_unref_child (child=0x55d3b5e4c690) at
../block.c:3010
#17 0x000055d3b5047998 in blk_remove_bs (blk=0x55d3b5e73b40) at
../block/block-backend.c:845
#18 0x000055d3b5046e38 in blk_delete (blk=0x55d3b5e73b40) at
../block/block-backend.c:461
#19 0x000055d3b50470dc in blk_unref (blk=0x55d3b5e73b40) at
../block/block-backend.c:516
#20 0x000055d3b4fdb20a in teardown_secondary () at
../tests/unit/test-replication.c:367
#21 0x000055d3b4fdb632 in test_secondary_continuous_replication () at
../tests/unit/test-replication.c:504
#22 0x00007f034b26979e in g_test_run_suite_internal () from
/lib64/libglib-2.0.so.0
#23 0x00007f034b26959b in g_test_run_suite_internal () from
/lib64/libglib-2.0.so.0
#24 0x00007f034b26959b in g_test_run_suite_internal () from
/lib64/libglib-2.0.so.0
#25 0x00007f034b269c8a in g_test_run_suite () from /lib64/libglib-2.0.so.0
#26 0x00007f034b269ca5 in g_test_run () from /lib64/libglib-2.0.so.0
#27 0x000055d3b4fdb9c0 in main (argc=1, argv=0x7fffcc998138) at
../tests/unit/test-replication.c:613
(gdb) fr 4
#4 0x000055d3b503d670 in job_state_transition (job=0x55d3b5e67020,
s1=JOB_STATUS_CONCLUDED) at ../job.c:186
186 assert(JobSTT[s0][s1]);
(gdb) list
181 JobStatus s0 = job->status;
182 assert(s1 >= 0 && s1 < JOB_STATUS__MAX);
183 trace_job_state_transition(job, job->ret,
184 JobSTT[s0][s1] ? "allowed" :
"disallowed",
185 JobStatus_str(s0), JobStatus_str(s1));
186 assert(JobSTT[s0][s1]);
187 job->status = s1;
188
189 if (!job_is_internal(job) && s1 != s0) {
190 qapi_event_send_job_status_change(job->id, job->status);
(gdb) p s0
$1 = JOB_STATUS_NULL
(gdb) p s1
$2 = JOB_STATUS_CONCLUDED
bisect points to "job: Add job_cancel_requested()"
And "bisecting" within this commit shows that the following helps:
diff --git a/job.c b/job.c
index be878ca5fc..bb52a1b58f 100644
--- a/job.c
+++ b/job.c
@@ -655,7 +655,7 @@ static void job_conclude(Job *job)
static void job_update_rc(Job *job)
{
- if (!job->ret && job_is_cancelled(job)) {
+ if (!job->ret && job_cancel_requested(job)) {
job->ret = -ECANCELED;
}
if (job->ret) {
- this returns job_update_rc to pre-patch behavior.
But why, I don't know:) More investigation is needed. probably replication code
is doing something wrong..
--
Best regards,
Vladimir