Hi Wouter,
It is a ubuntu 16.04, with the following package version:
ii nbd-client
1:3.13-1 amd64 Network
Block Device protocol - client
Initramfs is 0.122ubuntu8.5
I have removed the close(nbd), open(nbd) from the source code. It works ok
now, except for a single small issue: each disconnect/reconnect leaves back
a nbd-client process behind. Here is how it looks after 3 such disconnects.
root 355 0.0 0.2 4372 2312 ? SLs 00:30 0:00
@sbin/nbd-client 10.4.104.4 -N root /dev/nbd0 -swap -persist -systemd-mark
root 1816 0.0 0.0 4372 344 ? S 00:33 0:00
@sbin/nbd-client 10.4.104.4 -N root /dev/nbd0 -swap -persist -systemd-mark
root 1842 0.0 0.0 4372 344 ? S 00:34 0:00
@sbin/nbd-client 10.4.104.4 -N root /dev/nbd0 -swap -persist -systemd-mark
root 1843 0.0 0.0 0 0 ? S< 00:34 0:00 [nbd0]
The tcp socket is still connected from the original process, pid 355:
root@host:~# netstat -atnp | grep 10809
tcp 0 0 10.4.104.5:58666 10.4.104.4:10809
ESTABLISHED 355/nbd-client
stracing the processes 1816 and 1842 shows that both are doing a single
thing continuously:
open("/sys/block/nbd0/pid", O_RDONLY) = -1 ENOENT (No such file or
directory)
nanosleep({0, 100000000}, NULL) = 0
open("/sys/block/nbd0/pid", O_RDONLY) = -1 ENOENT (No such file or
directory)
nanosleep({0, 100000000}, NULL) = 0
open("/sys/block/nbd0/pid", O_RDONLY) = -1 ENOENT (No such file or
directory)
nanosleep({0, 100000000}, NULL) = 0
open("/sys/block/nbd0/pid", O_RDONLY) = -1 ENOENT (No such file or
directory)
nanosleep({0, 100000000}, NULL) = 0
which seems to be related to the following code:
while(check_conn(nbddev, 0)) {
nanosleep(&req, NULL);
}
On Thu, Nov 24, 2016 at 8:06 PM, Wouter Verhelst <[email protected]> wrote:
> Hi Victor,
>
> Can you give some context here? Which version of nbd, which kernel
> version, which initramfs implementation?
>
> (given what I'm seeing, I'm guessing initramfs-utils on a Debian or
> derivative, but can't be sure)
>
> On Thu, Nov 24, 2016 at 11:37:34AM +0200, Victor wrote:
> > Hello,
> >
> > When nbd-client is started at boot, from initramfs, in order to provide
> the
> > root filesystem, it is not persistent. The reason itself is pretty
> strange
> > (cannot open /dev/nbd0). Here is a sequence of steps to observe the
> behavior:
> >
> > 1. The system is booted ok. nbd-client is active:
> > root 359 0.2 0.2 4372 2212 ? SL 11:16 0:00 @sbin/
> > nbd-client 10.4.104.4 -N root /dev/nbd0 -swap -persist -systemd-mark
> > root 362 0.0 0.0 0 0 ? S< 11:16 0:00 [nbd0]
> >
> > /dev/nbd0 exists:
> > brw-rw---- 1 root disk 43, 0 Nov 24 09:19 /dev/nbd0
> >
> > and the nbd-client process uses it:
> > root@host:~# ls -l /proc/359/fd/
> > total 0
> > lr-x------ 1 root root 64 Nov 24 09:20 0 -> /dev/null
> > lrwx------ 1 root root 64 Nov 24 09:20 1 -> /dev/console (deleted)
> > lrwx------ 1 root root 64 Nov 24 09:20 2 -> /dev/console (deleted)
> > lrwx------ 1 root root 64 Nov 24 09:20 3 -> socket:[9447]
> > lrwx------ 1 root root 64 Nov 24 09:20 4 -> /dev/nbd0
>
> Yes, but it's probably a link to the initramfs /dev, rather than the
> root /dev.
>
> > 2. If I restart the nbd-server, the nbd-client dies/exits. The only way
> to know
> > what is happening is to strace the nbd-client. This generates a side
> effect:
> > when strace is attached, the ioctl exits and nbd-client tries to
> reconnect and
> > dies. So by just stracing the nbd-client, i simulate/force a disconnect/
> > reconnect without any need to restart nbd-server. My guess is that when i
> > restart the nbd-server, the same happens (but I just cannot see it).
> Please
> > observe the behavior:
> >
> > root@GTSRO-S-123456:~# strace -p 359
> > strace: Process 359 attached
> > getpid() = 359
> > write(2, "nbd,359: Kernel call returned: 1"..., 34) = 34
> > close(3) = 0
> (socket)
> > close(4) = 0
> (nbd device)
> > write(2, " Reconnecting\n", 14) = 14
> > socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE) = 3
> > bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
> > getsockname(3, {sa_family=AF_NETLINK, pid=359, groups=00000000}, [12]) =
> 0
> > sendto(3, "\24\0\0\0\26\0\1\3Y\2606X\0\0\0\0\0\0\0\0", 20, 0,
> {sa_family=
> > AF_NETLINK, pid=0, groups=00000000}, 12) = 20
> > recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
> > msg_iov(1)=[{"L\0\0\0\24\0\2\0Y\2606Xg\1\0\0\2\10\200\376\
> 1\0\0\0\10\0\1\0\177\
> > 0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 256
> > recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
> > msg_iov(1)=[{"H\0\0\0\24\0\2\0Y\2606Xg\1\0\0\n\200\200\376\
> 1\0\0\0\24\0\1\0\0\0
> > \0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 144
> > recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
> > msg_iov(1)=[{"\24\0\0\0\3\0\2\0Y\2606Xg\1\0\0\0\0\0\0", 4096}],
> msg_controllen=
> > 0, msg_flags=0}, 0) = 20
> > close(3) = 0
> > socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
> > connect(3, {sa_family=AF_INET, sin_port=htons(10809), sin_addr=inet_addr
> > ("10.4.104.4")}, 16) = 0
> > setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> (was opennet)
> > open("/dev/nbd0", O_RDWR) = -1 ENOENT (No such file or
> directory)
>
> and now we don't get anything.
>
> I'm guessing that the failure to open /dev/nbd0 here is due to the fact
> that it used to deal with files on the initramfs, rather than on the
> root filesystem. Halfway through nbd-client's run, the initramfs was
> junked out from under it, and suddenly files pointed to wildly different
> places. That's got to confuse something.
>
> [...]
> > My guess is that this has something to do with the fact that the initial
> /dev/
> > filesystem was moved to the /root/dev before initramfs does the chroot,
>
> Yes, that seems very likely.
>
> > but I just have no idea if/how this can be fixed.
>
> It's been a while since I last touched the -persist handling, but at a
> glance, I think we can do without the "close(nbd)" and "nbd =
> open(nbddev, O_RDWR)" calls in lines 1012 and 1023 of current
> nbd-client, and simply reuse the nbd device file descriptor that we
> already had in the reconnect. That way, we don't need to deal with
> opening files on the file system anymore and can therefore not run into
> any ENOENT errors.
>
> Could you try if that fixes it?
>
> Regards,
>
> --
> < ron> I mean, the main *practical* problem with C++, is there's like a
> dozen
> people in the world who think they really understand all of its
> rules,
> and pretty much all of them are just lying to themselves too.
> -- #debian-devel, OFTC, 2016-02-12
>
------------------------------------------------------------------------------
_______________________________________________
Nbd-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nbd-general