Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2019-06-25 Thread Barry Arndt
A few things:
 
I verified that this problem does not recreate on buster, as Ryan 
mentioned earlier, because of the disabling of lock elision in glibc.
 
I verified that this problem still exists on a Debian 9.8 base with 
upstream kernel 5.1.0-x, which contains up-to-date TM fixes.
 
Our kernel team is actively debugging this problem on upstream kernel 
5.2.0-rc5, and is in favor of disabling TM in Debian until this issue is 
resolved.
 
Thanks.
Barry



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2019-04-13 Thread Ivo De Decker
Hi Ben,

On Sun, Feb 10, 2019 at 05:07:07PM +, Ben Hutchings wrote:
> On Tue, 3 Jul 2018 18:51:49 -0700 Ryan Tandy  wrote:
> > Control: found -1 4.17.3-1
> > 
> > Hi,
> > 
> > On Mon, Jul 02, 2018 at 08:42:45PM -0700, Ryan Tandy wrote:
> > >However it looks like it might be resolved with the kernel in 
> > >unstable. I will run some more iterations and let you know.
> > 
> > The issue still reproduces with stretch userland and the current 
> > unstable kernel.
> > 
> > It cannot be reproduced with buster's glibc because lock elision was 
> > disabled in #878071, citing this issue.
> 
> This bug has been open for over 18 months with no sign of a solution,
> which is really not acceptable for a release architecture.  I propose
> to disable CONFIG_PPC_TRANSACTIONAL_MEMORY in both stretch and
> unstable, until a fix is available for TM.

We're getting pretty close to the release. Are you planning to disable
CONFIG_PPC_TRANSACTIONAL_MEMORY for buster?

Thanks,

Ivo



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2019-02-10 Thread Ben Hutchings
On Tue, 3 Jul 2018 18:51:49 -0700 Ryan Tandy  wrote:
> Control: found -1 4.17.3-1
> 
> Hi,
> 
> On Mon, Jul 02, 2018 at 08:42:45PM -0700, Ryan Tandy wrote:
> >However it looks like it might be resolved with the kernel in 
> >unstable. I will run some more iterations and let you know.
> 
> The issue still reproduces with stretch userland and the current 
> unstable kernel.
> 
> It cannot be reproduced with buster's glibc because lock elision was 
> disabled in #878071, citing this issue.

This bug has been open for over 18 months with no sign of a solution,
which is really not acceptable for a release architecture.  I propose
to disable CONFIG_PPC_TRANSACTIONAL_MEMORY in both stretch and
unstable, until a fix is available for TM.

Ben.
 
-- 
Ben Hutchings
The world is coming to an end.  Please log off.




signature.asc
Description: This is a digitally signed message part


Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2018-07-03 Thread Ryan Tandy

Control: found -1 4.17.3-1

Hi,

On Mon, Jul 02, 2018 at 08:42:45PM -0700, Ryan Tandy wrote:
However it looks like it might be resolved with the kernel in 
unstable. I will run some more iterations and let you know.


The issue still reproduces with stretch userland and the current 
unstable kernel.


It cannot be reproduced with buster's glibc because lock elision was 
disabled in #878071, citing this issue.


Thanks
Ryan



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2018-07-02 Thread Ryan Tandy

Control: found -1 4.9.88-1+deb9u1

Hi,

On Sun, Jul 01, 2018 at 03:42:12AM +0100, Ben Hutchings wrote:

Is this bug still present?


It still reproduces with the current stretch kernel:

debian@debian:~/openldap-2.4.44+dfsg/debian/build/tests$ ./run -b mdb 
test060-mt-hot
Cleaning up test run directory leftover from previous run.
Running ../../../tests/scripts/test060-mt-hot for mdb...
running defines.sh
Running slapadd to build slapd database...
Running slapindex to index slapd database...
Starting slapd on TCP/IP port 9011...
/home/debian/openldap-2.4.44+dfsg/debian/build/tests/../servers/slapd/slapd -s0 
-f /home/debian/openldap-2.4.44+dfsg/debian/build/tests/testrun/slapd.1.conf -h 
ldap://localhost:9011/ -d stats
Testing basic monitor search...
Monitor searches
Testing basic mt-hot search: 1 threads (1 x 5) loops...
./progs/slapd-mtread -H ldap://localhost:9011/ -D cn=Manager,dc=example,dc=com 
-w secret -e cn=Monitor -m 1 -L 1 -l 5
Testing basic mt-hot search: 5 threads (1 x 1) loops...
./progs/slapd-mtread -H ldap://localhost:9011/ -D cn=Manager,dc=example,dc=com 
-w secret -e cn=Monitor -m 5 -L 1 -l 1
Testing basic mt-hot search: 100 threads (5 x 100) loops...
./progs/slapd-mtread -H ldap://localhost:9011/ -D cn=Manager,dc=example,dc=com 
-w secret -e cn=Monitor -m 100 -L 5 -l 100
Random searches
Testing random mt-hot search: 1 threads (1 x 5) loops...
./progs/slapd-mtread -H ldap://localhost:9011/ -D cn=Manager,dc=example,dc=com 
-w secret -e dc=example,dc=com -f (objectclass=*) -m 1 -L 1 -l 5
Testing random mt-hot search: 5 threads (1 x 1) loops...
./progs/slapd-mtread -H ldap://localhost:9011/ -D cn=Manager,dc=example,dc=com 
-w secret -e dc=example,dc=com -f (objectclass=*) -m 5 -L 1 -l 1
slapd-mtread failed (139)!
debian@debian:~/openldap-2.4.44+dfsg/debian/build/tests$ uname -a
Linux debian 4.9.0-6-powerpc64le #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) 
ppc64le GNU/Linux

However it looks like it might be resolved with the kernel in unstable. 
I will run some more iterations and let you know.


a7771176 (4.15) looks interesting - it's specifically tagged as fixing 
the commit I isolated as the broken one (dc16b553), and the message 
sounds relevant. If the unstable kernel survives multiple runs I'll see 
what this does on 4.9.


https://patchwork.ozlabs.org/patch/833190/



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2018-07-01 Thread Ryan Tandy

On Sun, Jul 01, 2018 at 03:42:12AM +0100, Ben Hutchings wrote:

Is this bug still present?


Thanks for the ping. The Unicamp Minicloud, where I was granted access 
to a VM for testing, is currently down for maintenance until July 7. I 
will check this again after it comes back.




Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2018-06-30 Thread Ben Hutchings
Control: tag -1 moreinfo

On Fri, 4 Aug 2017 16:14:24 -0500 "Barry Arndt"  wrote:
> On Thu, 27 Jul 2017 10:20:12 -0300 Breno Leitao  
> wrote:
> > Although lack of recent updates, we are still working on this problem.
> > 
> > Barry (on CC) is allocated to work on this issue and should have updates 
> soon.
> > 
> > 
> > 
> 
> The offending line of code that Breno mentioned earlier was part of a 
> larger patch.  Looking at that larger patch, I'm afraid that just removing 
> the offending line of code would make that patch incomplete and could 
> possibly leave other errors behind.
> 
> We have a few patches for other issues in this area of code.  I tested a 
> couple of them, but they did not resolve this issue.  I'm working with 
> another developer (who is currently working on other items related to this 
> one) to gather more debug information to narrow down to the correct fix.

Is this bug still present?

I can see a fix to VSX handling in the kernel that went into mainline
just after this, and then into 4.9.45:

commit 5a69aec945d27e78abac9fd032533d3aaebf7c1e
Author: Benjamin Herrenschmidt 
Date:   Wed Aug 16 16:01:14 2017 +1000

powerpc: Fix VSX enabling/flushing to also test MSR_FP and MSR_VEC

Ben.

-- 
Ben Hutchings
Q.  Which is the greater problem in the world today,
ignorance or apathy?
A.  I don't know and I couldn't care less.



signature.asc
Description: This is a digitally signed message part


Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-08-04 Thread Barry Arndt
On Thu, 27 Jul 2017 10:20:12 -0300 Breno Leitao  
wrote:
> Although lack of recent updates, we are still working on this problem.
> 
> Barry (on CC) is allocated to work on this issue and should have updates 
soon.
> 
> 
> 

The offending line of code that Breno mentioned earlier was part of a 
larger patch.  Looking at that larger patch, I'm afraid that just removing 
the offending line of code would make that patch incomplete and could 
possibly leave other errors behind.

We have a few patches for other issues in this area of code.  I tested a 
couple of them, but they did not resolve this issue.  I'm working with 
another developer (who is currently working on other items related to this 
one) to gather more debug information to narrow down to the correct fix.



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-27 Thread Breno Leitao
Although lack of recent updates, we are still working on this problem.

Barry (on CC) is allocated to work on this issue and should have updates soon.



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-12 Thread Breno Leitao
Some new discover I did today:

1) On function do_random(), the 'values' pointer is being corrupted after the
rand() syscall - In the failure case. If I remove the rand() function, I do
not see corruption.

2) If I get the original 'values' pointer, save it and check it later, it is
corrupted also. This guarantee that the there is no one else changing the
pointer. So, I suspect that this value is being corrupted during the context
switch or vsx unavailable tm (which is being called also).

This is the type of the pointer corruption I see:

>From 0x7fff8970 to 0x7fff88000970.

Since the array pointer changes, the 'r' value will point to r + the
corruption displacement.



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-11 Thread Breno Leitao
hi Ryan,

On 07/11/2017 02:58 AM, Ryan Tandy wrote:
> Today I built Linux 4.12 from upstream source and the test program still
> crashes. I was looking at your fixes to initialize load_{fp,tm,vec} as well
> as someone else fixing the CONFIG_ALIVEC typo but none of those have helped.

Right, I tested it with the pending patches for HTM and the bug is still
there, so, I doubt is has been fixed already.

> I did confirm on this kernel that reverting 613036d9 still stops it from
> crashing. Tomorrow I will try to narrow it down to a specific change. There
> are only 4 hunks after all (the addition of msr_tm_active cannot be reverted
> as there are more calls to it now).

In fact I just did it and I found that the following patch fixes the
problem.  I am not able to understand why yet. Working on it right now.

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 9f3e2c932dcc..21bcb3b19758 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -231,7 +231,7 @@ void enable_kernel_fp(void)
 EXPORT_SYMBOL(enable_kernel_fp);
 
 static int restore_fp(struct task_struct *tsk) {
-   if (tsk->thread.load_fp || msr_tm_active(tsk->thread.regs->msr)) {
+   if (tsk->thread.load_fp) {
load_fp_state(>thread.fp_state);
current->thread.load_fp++;
return 1;

> It turns out it is _not_ compiler dependent. The test program compiled in a
> jessie chroot succeeds in that chroot and then crashes if I run the same
> binary in a stretch chroot. This also means I was wrong about the m{t,f}vsrd
> instructions being related, as gcc-4.9 doesn't emit them (for this particular
> program, at least).

I  understand that glibc might have VSX instructions, so, even if your
application is not using VSX instructions, it might be required
depending on the glibc version you are using, although the problem seems
to be on the float point (FP) side.

> objdump -d libpthread.so.0 output apparently lists some tbegin/tend
> instructions, but I suppose those could be selected at runtime?

Correct. I checked and Debian is enabling HTM[1] to do lock
ellision. It is not a option that you can change on runtime, we would
need to reconfigure/recompile glibc if we want to disable it.  There is
currently an effort to use glibc tunnables to enable/disable lock
elision at runtime, but this is still under development.

Out of curiosity, how did you bisect the kernel to find that commit-id?
Did you do it automatically?

[1] 
https://buildd.debian.org/status/fetch.php?pkg=glibc=ppc64el=2.24-12=1497900384=0
 (Check for --enable-lock-elision)



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-11 Thread Ryan Tandy

Control: tag -1 upstream

Hi Breno,

Today I built Linux 4.12 from upstream source and the test program still 
crashes. I was looking at your fixes to initialize load_{fp,tm,vec} as 
well as someone else fixing the CONFIG_ALIVEC typo but none of those 
have helped.


I did confirm on this kernel that reverting 613036d9 still stops it from 
crashing. Tomorrow I will try to narrow it down to a specific change. 
There are only 4 hunks after all (the addition of msr_tm_active cannot 
be reverted as there are more calls to it now).


It turns out it is _not_ compiler dependent. The test program compiled 
in a jessie chroot succeeds in that chroot and then crashes if I run the 
same binary in a stretch chroot. This also means I was wrong about the 
m{t,f}vsrd instructions being related, as gcc-4.9 doesn't emit them (for 
this particular program, at least).


On Mon, Jul 10, 2017 at 10:23:40AM -0300, Breno Leitao wrote:

Anyway, this is what I am going to investigate now:

1) If glibc's pthread method is using hardware transactional memory by
default.  I remember that upstream enabled it once and then disabled by
default.


objdump -d libpthread.so.0 output apparently lists some tbegin/tend 
instructions, but I suppose those could be selected at runtime?


Thanks
Ryan



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-10 Thread Breno Leitao
hi Ryan,

On Sun, Jul 09, 2017 at 08:56:59AM -0700, Ryan Tandy wrote:
> There seems to be a regression on powerpc64 (both endians) that can corrupt
> the vector-scalar registers (VSRs) in a threaded program.
> 
> I believe the bad commit is this one:
> 
> 4.9.0: 
> https://github.com/torvalds/linux/commit/dc16b553c949e81f37555777dc7bab66d78285a7
> 4.8.6: 
> https://github.com/linux-stable/linux-stable/commit/613036d9e91990f2043130ff8f78fd770432b3de

Nice work!

> My program is not using transactional memory itself, but I don't know what
> libc/libpthread might do internally.

That is interesting. I also found a bug last week related to hardware
transactional memory, and it corrupts vector-scalar register 0 (vs0)
when using hardware transactional memory. In our case, every exception
would cause a wrong reclaim (grab the register values from the
checkpointed area), which will affect later, the recheckpoint (put the
stacked registers into the checkpointed area). This is specific for
hardware transactional memory.

The full description of the bug and a possible patch should be found at:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-July/160117.html

> 4.8.7-1 is the first Debian kernel where I observe the crash. I rebuilt that
> package with the above commit reverted and it went away.

Anyway, this is what I am going to investigate now:

1) If glibc's pthread method is using hardware transactional memory by
default.  I remember that upstream enabled it once and then disabled by
default.

2) Investigate this commit ID and check for a possible corruption
depending on the result above.



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-09 Thread Ryan Tandy

Control: notforwarded -1
Control: reassign -1 src:linux 4.8.7-1
Control: affects -1 src:openldap

Hi debian-kernel,

There seems to be a regression on powerpc64 (both endians) that can 
corrupt the vector-scalar registers (VSRs) in a threaded program.


I believe the bad commit is this one:

4.9.0: 
https://github.com/torvalds/linux/commit/dc16b553c949e81f37555777dc7bab66d78285a7
4.8.6: 
https://github.com/linux-stable/linux-stable/commit/613036d9e91990f2043130ff8f78fd770432b3de

My program is not using transactional memory itself, but I don't 
know what libc/libpthread might do internally.


4.8.7-1 is the first Debian kernel where I observe the crash. I rebuilt 
that package with the above commit reverted and it went away.


The crash still occurs with 4.11.6-1 from unstable. I have not tested 
anything more recent.


I still need to work up a reduced test case. For now the steps to 
reproduce the crash with the openldap package are:


 apt-get build-dep openldap
 apt-get source openldap
 cd openldap-*/
 DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -T build
 cd debian/build/tests
 ./run -b bdb test060-mt-hot

Thanks
Ryan



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-08 Thread Ryan Tandy

On Sat, Jul 08, 2017 at 11:15:48AM -0700, Ryan Tandy wrote:
The bug reproduces consistently when building with stretch's toolchain 
and running stretch's kernel, but I have not seen it at all with 
jessie's kernel *nor* in a jessie chroot on stretch's kernel. I have 
also reproduced it with clang as well as gcc.


The newer compilers generate POWER8 vector instructions e.g. 
mtvsrd/mfvsrd while the older ones do not. A kernel bug related to (or 
triggered by) these instructions would meet the conditions I described. 
Howard suggested we could be looking at a kernel bug around FP/VSR 
save/restore.




Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-08 Thread Ryan Tandy
I spent some more time on this issue and I believe I isolated the 
problem to the following line:


int r = ((double)nvalues)*rand()/(RAND_MAX + 1.0);

http://sources.debian.net/src/openldap/2.4.44%2Bdfsg-7/tests/progs/slapd-mtread.c/#L682

Specifically, during the multi-threaded tests, casting 'nvalues' from 
int to double occasionally comes up with the wrong result.


e.g.:
(double)19 -> 61.00

Note that 'nvalues' is a local variable in the same function. There 
should be zero chance of it being shared with another thread or anything 
like that.


The bug reproduces consistently when building with stretch's toolchain 
and running stretch's kernel, but I have not seen it at all with 
jessie's kernel *nor* in a jessie chroot on stretch's kernel. I have 
also reproduced it with clang as well as gcc.




Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-07 Thread John Paul Adrian Glaubitz
On 07/07/2017 03:12 AM, Breno Leitao wrote:
> (...)
> That said, I understand that there are two bugs:
> 
> 1) we shouldn't iterate over values[] if it is bogus.
> 2) ldap_first_entry() shouldn't return NULL (real problem)
> 
> So, answering your question. My patch *didn't* fix the real issue (2), but the
> consequence of it. That explains why we do not see this on other systems.

Thanks for the explanation. Sounds quite tricky to track down.

> Anyway, I am still investigating the real problem here, and Howard, from
> OpenLdap, is kindly helping.

Great. Thank you so much for fixing this!

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-06 Thread Breno Leitao
Hey Adrian,

On Thu, Jul 06, 2017 at 10:44:21PM +0200, John Paul Adrian Glaubitz wrote:
> Hi Breno!
> 
> On 07/06/2017 09:31 PM, Breno Leitao wrote:
> > I think I found the real case of the problem here. There is an array
> > being allocated, and initialized until 'i'.
> > 
> > Later, we use a random number 'r' to access this array, and it will fail
> > if 'r' is bigger than 'i', since any value bigger than 'i' will not be
> > initialized. It might fail harder if 'r' is being allocated than 'nvalues', 
> > but
> > I didn't get this scenario during my debugs.
> 
> Interesting bug. I wonder how this bug is only triggering on ppc*. Is that
> code that is run on these targets only?

In fact, doing further debugs I understand that the problem is somewhere
else, and what I am seeing is just a consequence of a prior error that was not
prior handled. This is the test case:

 for ( i = 0, e = ldap_first_entry( ld, res ); e != NULL; i++, e = 
ldap_next_entry( ld, e ) )
 {
 values[ i ] = ldap_get_dn( ld, e );
 }
 values[ i ] = NULL;

 ...

 for ( i = 0; i < innerloop; i++ ) {
 int r = ((double)nvalues)*rand()/(RAND_MAX + 1.0);

 do_read( ld, values[ r ],
 srchattrs, noattrs, nobind, 1, maxretries,
 delay, force, chaserefs, idx );
 }


In ppc64el case, ldap_first_entry() is returning NULL, thus values[], other
than values[0], but the code will contain garbage and we will interate over it
later.

That said, I understand that there are two bugs:

1) we shouldn't iterate over values[] if it is bogus.
2) ldap_first_entry() shouldn't return NULL (real problem)

So, answering your question. My patch *didn't* fix the real issue (2), but the
consequence of it. That explains why we do not see this on other systems.

Anyway, I am still investigating the real problem here, and Howard, from
OpenLdap, is kindly helping.



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-06 Thread John Paul Adrian Glaubitz
Hi Breno!

On 07/06/2017 09:31 PM, Breno Leitao wrote:
> I think I found the real case of the problem here. There is an array
> being allocated, and initialized until 'i'.
> 
> Later, we use a random number 'r' to access this array, and it will fail
> if 'r' is bigger than 'i', since any value bigger than 'i' will not be
> initialized. It might fail harder if 'r' is being allocated than 'nvalues', 
> but
> I didn't get this scenario during my debugs.

Interesting bug. I wonder how this bug is only triggering on ppc*. Is that
code that is run on these targets only?

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-06 Thread Breno Leitao
Ryan,

> Nice. I was able to reproduce it and debug it further. The problem seems
> to be related to a invalid branch/jump, the the next address is not
> memory mapped, thus the segfault. The new address is completely random,
> and definitely is wrong. 

I think I found the real case of the problem here. There is an array
being allocated, and initialized until 'i'.

Later, we use a random number 'r' to access this array, and it will fail
if 'r' is bigger than 'i', since any value bigger than 'i' will not be
initialized. It might fail harder if 'r' is being allocated than 'nvalues', but
I didn't get this scenario during my debugs.

That is how I see the array and how we are trying to access it:

0 invalues  RAND_MAX
|-|---|--|
| initialized |   mapped  | unmapped |
|-|---|--|

I understand that forcing the values to be smaller than 'i' is what we want.

This is a patch I've created to fix this problem. The problem seems to happen
upstream also. If this patch is OK for you , I will create a pull request
to send this fix upstream.

---

diff --git a/tests/progs/slapd-mtread.c b/tests/progs/slapd-mtread.c
index c9c872918..15316b360 100644
--- a/tests/progs/slapd-mtread.c
+++ b/tests/progs/slapd-mtread.c
@@ -635,7 +635,7 @@ do_random( LDAP *ld,
int i = 0, do_retry = maxretries;
char*attrs[ 2 ];
int rc = LDAP_SUCCESS;
-   int nvalues = 0;
+   int maxnvalue, nvalues = 0;
char**values = NULL;
LDAPMessage *res = NULL, *e = NULL;
charthrstr[BUFSIZ];
@@ -668,6 +668,7 @@ do_random( LDAP *ld,
values[ i ] = ldap_get_dn( ld, e );
}
values[ i ] = NULL;
+   maxnvalue = i;

ldap_msgfree( res );

@@ -680,6 +681,7 @@ do_random( LDAP *ld,

for ( i = 0; i < innerloop; i++ ) {
int r = ((double)nvalues)*rand()/(RAND_MAX + 1.0);
+   r %= maxnvalue;

do_read( ld, values[ r ],
srchattrs, noattrs, nobind, 1, maxretries,



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-04 Thread Breno Leitao
Hi Ryan,

On Mon, Jul 03, 2017 at 03:39:35PM -0700, Ryan Tandy wrote:
> Hi debian-powerpc,
>
> Would a ppc64(el) porter be able to help me look at #866122? I have
> requested a porterbox account but it's not gone through yet, and I am unable
> to reproduce the issue at all in a qemu VM.

You can also request a VM on the minicloud at
http://openpower.ic.unicamp.br/minicloud/ if you wish. They are usually
quick on creating accounts.

> The openldap test suite is failing on ppc64 and ppc64el in stretch and
> unstable: the slapd-mtread helper program segfaults (exit 139) in a certain
> test case.
>
> It looks like the builds tend to succeed on jessie's kernel and fail on
> stretch's kernel:

In fact, this problem seems to reproduce once in a while, thus, I would
not trust that it might be related to the kernel/gcc combination at this
very beginning. I am suspecting that it might be related to the amount
of threads created and the order, but I do not have a conclusion yet.
Still investigating.

>  apt-get build-dep openldap
>  apt-get source openldap
>  cd openldap-*/
>  DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -T build
>  cd debian/build/tests
>  ./run -b bdb test060-mt-hot

Nice. I was able to reproduce it and debug it further. The problem seems
to be related to a invalid branch/jump, the the next address is not
memory mapped, thus the segfault. The new address is completely random,
and definitely is wrong. The link register (LR), which is register that
shows the return of the branch (similar to call() on amd64) is always
constant when ALSR is disabled. Other than that, I also saw a stack
corruption, which caused the backtrace to be completely bogus.

Anyway, myself and a colleague are still investigating this problem. I will keep
you informed of the progress of this issue.

Thanks,
Breno



Bug#866122: slapd-mtread crash on ppc64{,el} in stretch/sid

2017-07-03 Thread Ryan Tandy

Hi debian-powerpc,

Would a ppc64(el) porter be able to help me look at #866122? I have 
requested a porterbox account but it's not gone through yet, and I am 
unable to reproduce the issue at all in a qemu VM.


The openldap test suite is failing on ppc64 and ppc64el in stretch and 
unstable: the slapd-mtread helper program segfaults (exit 139) in a 
certain test case.


It looks like the builds tend to succeed on jessie's kernel and fail on 
stretch's kernel:


* 2.4.44+dfsg-4 failed on ppc64el-osuosl-01
 - kernel 4.9.0-2-powerpc64le
 - gcc-6_6.3.0-13
 - binutils_2.28-3
* 2.4.44+dfsg-5 succeeded on ppc64el-unicamp-01
 - kernel 3.16.0-4-powerpc64le
 - gcc-6_6.3.0-18
 - binutils_2.28-5
* 2.4.44+dfsg-6 failed on ppc64el-osuosl-01
 - kernel 4.9.0-3-powerpc64le
 - gcc-6_6.3.0-19
 - binutils_2.28-6
* 2.4.44+dfsg-7 failed on ppc64el-unicamp-01
 - kernel 4.9.0-3-powerpc64le
 - gcc-6_6.3.0-19
 - binutils binutils_2.28-6

with one exception:

* 2.4.44+dfsg-5~bpo8+1 succeeded on ppc64el-osuosl-01
 - kernel 4.9.0-3-powerpc64le
 - gcc-4.9_4.9.2-10
 - binutils_2.25-5+deb8u1

The steps to build and run just the failing test are:

 apt-get build-dep openldap
 apt-get source openldap
 cd openldap-*/
 DEB_BUILD_OPTIONS=nocheck dpkg-buildpackage -T build
 cd debian/build/tests
 ./run -b bdb test060-mt-hot

The slapd-mtread output is in debian/build/tests/testrun/mtread.out.
The testrun folder also contains other logs and files from the test run.
The slapd-mtread source is in tests/progs/slapd-mtread.c.

Thanks in advance for any help you can offer.

Ryan