Bug#1063338: [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")

2024-02-07 Thread Alexander Aring
Hi,

On Wed, Feb 7, 2024 at 1:33 PM Jordan Rife  wrote:
>
> On Wed, Feb 7, 2024 at 2:39 AM Salvatore Bonaccorso  wrote:
> >
> > Hi Valentin, hi all
> >
> > [This is about a regression reported in Debian for 6.1.67]
> >
> > On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> > > Package: linux-image-amd64
> > > Version: 6.1.76+1
> > > Source: linux
> > > Source-Version: 6.1.76+1
> > > Severity: important
> > > Control: notfound -1 6.6.15-2
> > >
> > > Dear Maintainers,
> > >
> > > We discovered a bug affecting dlm that prevents any tcp communications by
> > > dlm when booted with debian kernel 6.1.76-1.
> > >
> > > Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> > > expected nodes) but as soon as we try to add a lockspace dmesg shows:
> > > ```
> > > dlm: Using TCP for communications
> > > dlm: cannot start dlm midcomms -97
> > > ```
> > >
> > > It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> > > (e9cdebbe) was merged to 6.1.
> > >
> > > Checking the code it seems that the changed function dlm_tcp_listen_bind()
> > > fails with exit code 97 (EAFNOSUPPORT)
> > > It is called from
> > >
> > > dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> > > dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> > > dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> > > dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> > >
> > > The error code is returned all the way to threads_start() where the error
> > > message is emmitted.
> > >
> > > Booting with the unsigned kernel from testing (6.6.15-2), which also
> > > contains this commit, works without issues.
> > >
> > > I'm not sure what additional changes are required to get this working or 
> > > if
> > > rolling back this change is an option.
> > >
> > > We'd be happy to test patches that might fix this issue.
> >
> > Thanks for your report. So we have a 6.1.76 specific regression for
> > the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
> > kernel_bind()") .
> >
> > Let's loop in the upstream regression list for tracking and people
> > involved for the subsystem to see if the issue can be identified. As
> > it is working for 6.6.15 which includes the commit backport as well it
> > might be very well that a prerequisite is missing.
> >
> > # annotate regression with 6.1.y specific commit
> > #regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
> > #regzbot link: https://bugs.debian.org/1063338
> >
> > Any ideas?
> >
> > Regards,
> > Salvatore
>
>
> Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
> and 6.6 stable branches,
> it looks like there is a mismatch here with the dlm_local_addr[0] parameter.
>
> 6.1
> 
>
> static int dlm_tcp_listen_bind(struct socket *sock)
> {
> int addr_len;
>
> /* Bind to our port */
> make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
>addr_len);
> }
>
> 6.6
> 
> static int dlm_tcp_listen_bind(struct socket *sock)
> {
> int addr_len;
>
> /* Bind to our port */
> make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
> return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
>addr_len);
> }
>
> 6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) 
> which
> changed
>
> static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
>
> to
>
> static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];
>
> It looks like kernel_bind() in 6.1 needs to be modified to match.
>

makes sense. I tried to cherry-pick e9cdebbe23f1 ("dlm: use
kernel_connect() and kernel_bind()") on v6.1.67 as I don't see it
there. It failed and does not apply cleanly.

Are we talking here about a debian kernel specific backport? If so,
maybe somebody missed to modify those parts you mentioned.

- Alex



Bug#1063338: [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")

2024-02-07 Thread Jordan Rife
On Wed, Feb 7, 2024 at 2:39 AM Salvatore Bonaccorso  wrote:
>
> Hi Valentin, hi all
>
> [This is about a regression reported in Debian for 6.1.67]
>
> On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> > Package: linux-image-amd64
> > Version: 6.1.76+1
> > Source: linux
> > Source-Version: 6.1.76+1
> > Severity: important
> > Control: notfound -1 6.6.15-2
> >
> > Dear Maintainers,
> >
> > We discovered a bug affecting dlm that prevents any tcp communications by
> > dlm when booted with debian kernel 6.1.76-1.
> >
> > Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> > expected nodes) but as soon as we try to add a lockspace dmesg shows:
> > ```
> > dlm: Using TCP for communications
> > dlm: cannot start dlm midcomms -97
> > ```
> >
> > It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> > (e9cdebbe) was merged to 6.1.
> >
> > Checking the code it seems that the changed function dlm_tcp_listen_bind()
> > fails with exit code 97 (EAFNOSUPPORT)
> > It is called from
> >
> > dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> > dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> > dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> > dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> >
> > The error code is returned all the way to threads_start() where the error
> > message is emmitted.
> >
> > Booting with the unsigned kernel from testing (6.6.15-2), which also
> > contains this commit, works without issues.
> >
> > I'm not sure what additional changes are required to get this working or if
> > rolling back this change is an option.
> >
> > We'd be happy to test patches that might fix this issue.
>
> Thanks for your report. So we have a 6.1.76 specific regression for
> the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
> kernel_bind()") .
>
> Let's loop in the upstream regression list for tracking and people
> involved for the subsystem to see if the issue can be identified. As
> it is working for 6.6.15 which includes the commit backport as well it
> might be very well that a prerequisite is missing.
>
> # annotate regression with 6.1.y specific commit
> #regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
> #regzbot link: https://bugs.debian.org/1063338
>
> Any ideas?
>
> Regards,
> Salvatore


Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
and 6.6 stable branches,
it looks like there is a mismatch here with the dlm_local_addr[0] parameter.

6.1


static int dlm_tcp_listen_bind(struct socket *sock)
{
int addr_len;

/* Bind to our port */
make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
   addr_len);
}

6.6

static int dlm_tcp_listen_bind(struct socket *sock)
{
int addr_len;

/* Bind to our port */
make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
   addr_len);
}

6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
changed

static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];

to

static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];

It looks like kernel_bind() in 6.1 needs to be modified to match.


-Jordan



Bug#1063338: [regression 6.1.67] dlm: cannot start dlm midcomms -97 after backport of e9cdebbe23f1 ("dlm: use kernel_connect() and kernel_bind()")

2024-02-07 Thread Salvatore Bonaccorso
Hi Valentin, hi all

[This is about a regression reported in Debian for 6.1.67]

On Tue, Feb 06, 2024 at 01:00:11PM +0100, Valentin Kleibel wrote:
> Package: linux-image-amd64
> Version: 6.1.76+1
> Source: linux
> Source-Version: 6.1.76+1
> Severity: important
> Control: notfound -1 6.6.15-2
> 
> Dear Maintainers,
> 
> We discovered a bug affecting dlm that prevents any tcp communications by
> dlm when booted with debian kernel 6.1.76-1.
> 
> Dlm startup works (corosync-cpgtool shows the dlm:controld group with all
> expected nodes) but as soon as we try to add a lockspace dmesg shows:
> ```
> dlm: Using TCP for communications
> dlm: cannot start dlm midcomms -97
> ```
> 
> It seems that commit "dlm: use kernel_connect() and kernel_bind()"
> (e9cdebbe) was merged to 6.1.
> 
> Checking the code it seems that the changed function dlm_tcp_listen_bind()
> fails with exit code 97 (EAFNOSUPPORT)
> It is called from
> 
> dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
> dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
> dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
> dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()
> 
> The error code is returned all the way to threads_start() where the error
> message is emmitted.
> 
> Booting with the unsigned kernel from testing (6.6.15-2), which also
> contains this commit, works without issues.
> 
> I'm not sure what additional changes are required to get this working or if
> rolling back this change is an option.
> 
> We'd be happy to test patches that might fix this issue.

Thanks for your report. So we have a 6.1.76 specific regression for
the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
kernel_bind()") .

Let's loop in the upstream regression list for tracking and people
involved for the subsystem to see if the issue can be identified. As
it is working for 6.6.15 which includes the commit backport as well it
might be very well that a prerequisite is missing.

# annotate regression with 6.1.y specific commit
#regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
#regzbot link: https://bugs.debian.org/1063338

Any ideas?

Regards,
Salvatore