I can't reproduce this, adding the bug reporter on CC. More below ... On Wed, May 08, 2024 at 10:16:55AM -0500, Eric Blake wrote: > On Wed, May 08, 2024 at 09:16:19AM +0100, Richard W.M. Jones wrote: > > Hi Eric, > > > > ----- Forwarded message from not...@aur.archlinux.org ----- > > > > solsticedhiver [1] added the following comment to nbdkit [2]: > > > > There is still one test failing now test-blocksize-sharding.sh. I > > don't know if it is expected or not > > The full test-suite.log https://0x0.st/X8TI.log > > > > [1] https://aur.archlinux.org/account/solsticedhiver/ > > [2] https://aur.archlinux.org/pkgbase/nbdkit/ > > > > ----- End forwarded message ----- > > > > I've attached the fragment of the failure log that they sent. However > > I don't understand the test enough to see where the error happens. > > > > Additionally I think there's a problem with the various 'print'- > > statements in the Python code. They probably need to flush stdout > > otherwise they don't appear coherent with error messages. Something > > like this thoroughout: > > > > -print("sanity check") > > +print("sanity check", flush=True) > > That sounds like a useful change no matter what. It may also be worth > adding some more timestamps into the sanity check portion, to make > sure that the plugin is responding as fast as we expect, and not under > heavy load. > > > # First pass: check that two aligned operations work in parallel > > # Total time should be closer to 2 seconds, rather than 4 if serialized > > print("sanity check") > > ba1 = bytearray(b"1"*16) > > ba2 = bytearray(b"2"*16) > > buf1 = nbd.Buffer.from_bytearray(ba1) > > buf2 = nbd.Buffer.from_bytearray(ba2) > > touch(witness) > > start_t = time.time() > > h.aio_pwrite(buf1, 0) > > h.aio_pwrite(buf2, 0) > > > > while h.aio_in_flight() > 0: > > h.poll(-1) > > end_t = time.time() > > os.unlink(witness) > > > > out = h.pread(16,0) > > print(out) > > t = end_t - start_t > > print(t) > > assert out in [b"1"*16, b"2"*16] > > assert t >= 2 and t <= 3 > > Based on the later message, it looks like t is higher than expected, > and that this was the assertion that is failing... > > > libnbd: debug: nbdsh: nbd_connect_uri: leave: ret=0 > > libnbd: debug: nbdsh: nbd_aio_pwrite: enter: buf=" > > 0000: 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 |1111111111111111| > > " count=16 offset=0 completion=NULL flags=0x0 > > libnbd: debug: nbdsh: nbd_aio_pwrite: leave: ret=1 > > libnbd: debug: nbdsh: nbd_aio_pwrite: enter: buf=" > > 0000: 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 |2222222222222222| > > " count=16 offset=0 completion=NULL flags=0x0 > > libnbd: debug: nbdsh: nbd_aio_pwrite: leave: ret=2 > > libnbd: debug: nbdsh: nbd_aio_in_flight: enter: > > libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=2 > > We definitely kicked off two parallel aio_pwrite commands... > > > libnbd: debug: nbdsh: nbd_poll: enter: timeout=-1 > > libnbd: debug: nbdsh: nbd_poll: poll start: events=1 > > libnbd: debug: nbdsh: nbd_poll: poll end: r=1 revents=1 > > libnbd: debug: nbdsh: nbd_poll: leave: ret=1 > > libnbd: debug: nbdsh: nbd_aio_in_flight: enter: > > libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=1 > > libnbd: debug: nbdsh: nbd_poll: enter: timeout=-1 > > libnbd: debug: nbdsh: nbd_poll: poll start: events=1 > > libnbd: debug: nbdsh: nbd_poll: poll end: r=1 revents=1 > > libnbd: debug: nbdsh: nbd_poll: leave: ret=1 > > libnbd: debug: nbdsh: nbd_aio_in_flight: enter: > > libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=0 > > ...and we were able to poll until both completed, but lack timestamps > to see how long we polled each time... > > > libnbd: debug: nbdsh: nbd_pread: enter: buf=<buf> count=16 offset=0 > > flags=0x0 > > libnbd: debug: nbdsh: nbd_pread: poll start: events=1 > > libnbd: debug: nbdsh: nbd_pread: poll end: r=1 revents=1 > > libnbd: debug: nbdsh: nbd_pread: leave: ret=0 > > Traceback (most recent call last): > > File "<frozen runpy>", line 198, in _run_module_as_main > > File "<frozen runpy>", line 88, in _run_code > > File "/usr/lib/python3.12/site-packages/nbd.py", line 3986, in <module> > > nbdsh.shell() > > File "/usr/lib/python3.12/site-packages/nbdsh.py", line 149, in shell > > do_snippet[act](arg) > > File "/usr/lib/python3.12/site-packages/nbdsh.py", line 140, in <lambda> > > "command": lambda arg: exec(arg, d, d), > > ^^^^^^^^^^^^^^^ > > File "<string>", line 32, in <module> > > AssertionError > > sanity check > > bytearray(b'2222222222222222') > > 5.224142789840698 > > ...we then checked the contents (it looks like the write with contents > "2"*16 landed last), and overall we spent 5.2 seconds doing all this > work, when we expected between 2 and 3 seconds. Given that the server > is set up with the delay filter adding 2 seconds before each pwrite, > it seems like we would expect > 2 seconds no matter what, but that we > would be > 4 seconds only if the two writes are serialized or else > heavy load gets in the way of fast execution (we should have < 4 > seconds if the two writes are done in parallel). > > The assert that t <= 3 is probably stricter than necessary; checking > whether t < 4 is going to add a bit more leeway. > > So how about re-running the test with these enhancements, to see if we > can learn more about the situation? > > diff --git i/tests/test-blocksize-sharding.sh > w/tests/test-blocksize-sharding.sh > index 7e926ad1..3711c49c 100755 > --- i/tests/test-blocksize-sharding.sh > +++ w/tests/test-blocksize-sharding.sh > @@ -80,9 +80,13 @@ witness = os.getenv("witness") > def touch(path): > open(path, "a").close() > > +def progress(prefix, start): > + t = time.time() - start > + print("%s: %g" % (prefix, t), flush=True) > + > # First pass: check that two aligned operations work in parallel > # Total time should be closer to 2 seconds, rather than 4 if serialized > -print("sanity check") > +print("sanity check", flush=True) > ba1 = bytearray(b"1"*16) > ba2 = bytearray(b"2"*16) > buf1 = nbd.Buffer.from_bytearray(ba1) > @@ -91,21 +95,24 @@ touch(witness) > start_t = time.time() > h.aio_pwrite(buf1, 0) > h.aio_pwrite(buf2, 0) > +progress("after writes", start_t) > > while h.aio_in_flight() > 0: > h.poll(-1) > + progress("after poll", start_t) > end_t = time.time() > os.unlink(witness) > > out = h.pread(16,0) > -print(out) > +progress("after read", start_t) > +print(out, flush=True) > t = end_t - start_t > -print(t) > +print("elapsed write time: %g" % t, flush=True) > assert out in [b"1"*16, b"2"*16] > assert t >= 2 and t <= 3 > > # Next pass: try to kick off unaligned first > -print("unaligned first") > +print("unaligned first", flush=True) > h.zero(16, 0) > ba3 = bytearray(b"3"*12) > ba4 = bytearray(b"4"*16) > @@ -122,14 +129,14 @@ end_t = time.time() > os.unlink(witness) > > out = h.pread(16,0) > -print(out) > +print(out, flush=True) > t = end_t - start_t > -print(t) > +print(t, flush=True) > assert out in [b"4"*4 + b"3"*12, b"4"*16] > assert t >= 8 > > # Next pass: try to kick off aligned first > -print("aligned first") > +print("aligned first", flush=True) > ba5 = bytearray(b"5"*16) > ba6 = bytearray(b"6"*12) > buf5 = nbd.Buffer.from_bytearray(ba5) > @@ -146,9 +153,9 @@ end_t = time.time() > os.unlink(witness) > > out = h.pread(16,0) > -print(out) > +print(out, flush=True) > t = end_t - start_t > -print(t) > +print(t, flush=True) > assert out in [b"5"*4 + b"6"*12, b"5"*16] > assert t >= 8 > '
I think we should add this patch upstream anyway, especially the flushes. More debugging is better in case we can't reproduce this now and it turns up again in future. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW _______________________________________________ Libguestfs mailing list -- guestfs@lists.libguestfs.org To unsubscribe send an email to guestfs-le...@lists.libguestfs.org