[Libguestfs] Re: nbdkit test-blocksize-sharding.sh failing

Richard W.M. Jones Wed, 08 May 2024 08:51:55 -0700

I can't reproduce this, adding the bug reporter on CC.  More below ...

On Wed, May 08, 2024 at 10:16:55AM -0500, Eric Blake wrote:
> On Wed, May 08, 2024 at 09:16:19AM +0100, Richard W.M. Jones wrote:
> > Hi Eric,
> > 
> > ----- Forwarded message from not...@aur.archlinux.org -----
> > 
> > solsticedhiver [1] added the following comment to nbdkit [2]:
> > 
> > There is still one test failing now test-blocksize-sharding.sh. I
> > don't know if it is expected or not
> > The full test-suite.log https://0x0.st/X8TI.log
> > 
> > [1] https://aur.archlinux.org/account/solsticedhiver/
> > [2] https://aur.archlinux.org/pkgbase/nbdkit/
> > 
> > ----- End forwarded message -----
> > 
> > I've attached the fragment of the failure log that they sent.  However
> > I don't understand the test enough to see where the error happens.
> > 
> > Additionally I think there's a problem with the various 'print'-
> > statements in the Python code.  They probably need to flush stdout
> > otherwise they don't appear coherent with error messages.  Something
> > like this thoroughout:
> > 
> > -print("sanity check")
> > +print("sanity check", flush=True)
> 
> That sounds like a useful change no matter what.  It may also be worth
> adding some more timestamps into the sanity check portion, to make
> sure that the plugin is responding as fast as we expect, and not under
> heavy load.
> 
> > # First pass: check that two aligned operations work in parallel
> > # Total time should be closer to 2 seconds, rather than 4 if serialized
> > print("sanity check")
> > ba1 = bytearray(b"1"*16)
> > ba2 = bytearray(b"2"*16)
> > buf1 = nbd.Buffer.from_bytearray(ba1)
> > buf2 = nbd.Buffer.from_bytearray(ba2)
> > touch(witness)
> > start_t = time.time()
> > h.aio_pwrite(buf1, 0)
> > h.aio_pwrite(buf2, 0)
> > 
> > while h.aio_in_flight() > 0:
> >     h.poll(-1)
> > end_t = time.time()
> > os.unlink(witness)
> > 
> > out = h.pread(16,0)
> > print(out)
> > t = end_t - start_t
> > print(t)
> > assert out in [b"1"*16, b"2"*16]
> > assert t >= 2 and t <= 3
> 
> Based on the later message, it looks like t is higher than expected,
> and that this was the assertion that is failing...
> 
> > libnbd: debug: nbdsh: nbd_connect_uri: leave: ret=0
> > libnbd: debug: nbdsh: nbd_aio_pwrite: enter: buf="
> > 0000: 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 |1111111111111111|
> > " count=16 offset=0 completion=NULL flags=0x0
> > libnbd: debug: nbdsh: nbd_aio_pwrite: leave: ret=1
> > libnbd: debug: nbdsh: nbd_aio_pwrite: enter: buf="
> > 0000: 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 |2222222222222222|
> > " count=16 offset=0 completion=NULL flags=0x0
> > libnbd: debug: nbdsh: nbd_aio_pwrite: leave: ret=2
> > libnbd: debug: nbdsh: nbd_aio_in_flight: enter:
> > libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=2
> 
> We definitely kicked off two parallel aio_pwrite commands...
> 
> > libnbd: debug: nbdsh: nbd_poll: enter: timeout=-1
> > libnbd: debug: nbdsh: nbd_poll: poll start: events=1
> > libnbd: debug: nbdsh: nbd_poll: poll end: r=1 revents=1
> > libnbd: debug: nbdsh: nbd_poll: leave: ret=1
> > libnbd: debug: nbdsh: nbd_aio_in_flight: enter:
> > libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=1
> > libnbd: debug: nbdsh: nbd_poll: enter: timeout=-1
> > libnbd: debug: nbdsh: nbd_poll: poll start: events=1
> > libnbd: debug: nbdsh: nbd_poll: poll end: r=1 revents=1
> > libnbd: debug: nbdsh: nbd_poll: leave: ret=1
> > libnbd: debug: nbdsh: nbd_aio_in_flight: enter:
> > libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=0
> 
> ...and we were able to poll until both completed, but lack timestamps
> to see how long we polled each time...
> 
> > libnbd: debug: nbdsh: nbd_pread: enter: buf=<buf> count=16 offset=0 
> > flags=0x0
> > libnbd: debug: nbdsh: nbd_pread: poll start: events=1
> > libnbd: debug: nbdsh: nbd_pread: poll end: r=1 revents=1
> > libnbd: debug: nbdsh: nbd_pread: leave: ret=0
> > Traceback (most recent call last):
> >   File "<frozen runpy>", line 198, in _run_module_as_main
> >   File "<frozen runpy>", line 88, in _run_code
> >   File "/usr/lib/python3.12/site-packages/nbd.py", line 3986, in <module>
> >     nbdsh.shell()
> >   File "/usr/lib/python3.12/site-packages/nbdsh.py", line 149, in shell
> >     do_snippet[act](arg)
> >   File "/usr/lib/python3.12/site-packages/nbdsh.py", line 140, in <lambda>
> >     "command": lambda arg: exec(arg, d, d),
> >                            ^^^^^^^^^^^^^^^
> >   File "<string>", line 32, in <module>
> > AssertionError
> > sanity check
> > bytearray(b'2222222222222222')
> > 5.224142789840698
> 
> ...we then checked the contents (it looks like the write with contents
> "2"*16 landed last), and overall we spent 5.2 seconds doing all this
> work, when we expected between 2 and 3 seconds.  Given that the server
> is set up with the delay filter adding 2 seconds before each pwrite,
> it seems like we would expect > 2 seconds no matter what, but that we
> would be > 4 seconds only if the two writes are serialized or else
> heavy load gets in the way of fast execution (we should have < 4
> seconds if the two writes are done in parallel).
> 
> The assert that t <= 3 is probably stricter than necessary; checking
> whether t < 4 is going to add a bit more leeway.
> 
> So how about re-running the test with these enhancements, to see if we
> can learn more about the situation?
> 
> diff --git i/tests/test-blocksize-sharding.sh 
> w/tests/test-blocksize-sharding.sh
> index 7e926ad1..3711c49c 100755
> --- i/tests/test-blocksize-sharding.sh
> +++ w/tests/test-blocksize-sharding.sh
> @@ -80,9 +80,13 @@ witness = os.getenv("witness")
>  def touch(path):
>      open(path, "a").close()
> 
> +def progress(prefix, start):
> +    t = time.time() - start
> +    print("%s: %g" % (prefix, t), flush=True)
> +
>  # First pass: check that two aligned operations work in parallel
>  # Total time should be closer to 2 seconds, rather than 4 if serialized
> -print("sanity check")
> +print("sanity check", flush=True)
>  ba1 = bytearray(b"1"*16)
>  ba2 = bytearray(b"2"*16)
>  buf1 = nbd.Buffer.from_bytearray(ba1)
> @@ -91,21 +95,24 @@ touch(witness)
>  start_t = time.time()
>  h.aio_pwrite(buf1, 0)
>  h.aio_pwrite(buf2, 0)
> +progress("after writes", start_t)
> 
>  while h.aio_in_flight() > 0:
>      h.poll(-1)
> +    progress("after poll", start_t)
>  end_t = time.time()
>  os.unlink(witness)
> 
>  out = h.pread(16,0)
> -print(out)
> +progress("after read", start_t)
> +print(out, flush=True)
>  t = end_t - start_t
> -print(t)
> +print("elapsed write time: %g" % t, flush=True)
>  assert out in [b"1"*16, b"2"*16]
>  assert t >= 2 and t <= 3
> 
>  # Next pass: try to kick off unaligned first
> -print("unaligned first")
> +print("unaligned first", flush=True)
>  h.zero(16, 0)
>  ba3 = bytearray(b"3"*12)
>  ba4 = bytearray(b"4"*16)
> @@ -122,14 +129,14 @@ end_t = time.time()
>  os.unlink(witness)
> 
>  out = h.pread(16,0)
> -print(out)
> +print(out, flush=True)
>  t = end_t - start_t
> -print(t)
> +print(t, flush=True)
>  assert out in [b"4"*4 + b"3"*12, b"4"*16]
>  assert t >= 8
> 
>  # Next pass: try to kick off aligned first
> -print("aligned first")
> +print("aligned first", flush=True)
>  ba5 = bytearray(b"5"*16)
>  ba6 = bytearray(b"6"*12)
>  buf5 = nbd.Buffer.from_bytearray(ba5)
> @@ -146,9 +153,9 @@ end_t = time.time()
>  os.unlink(witness)
> 
>  out = h.pread(16,0)
> -print(out)
> +print(out, flush=True)
>  t = end_t - start_t
> -print(t)
> +print(t, flush=True)
>  assert out in [b"5"*4 + b"6"*12, b"5"*16]
>  assert t >= 8
>  '


I think we should add this patch upstream anyway, especially the
flushes.  More debugging is better in case we can't reproduce this now
and it turns up again in future.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
_______________________________________________
Libguestfs mailing list -- guestfs@lists.libguestfs.org
To unsubscribe send an email to guestfs-le...@lists.libguestfs.org

[Libguestfs] Re: nbdkit test-blocksize-sharding.sh failing

Reply via email to