Lemote-3a-itx-a1101 kernel/PMON bug (Was: Bug#858405: xmlto: intermittent Segmentation fault when building manpages for libreswan on mips64el)

2017-03-24 Thread James Cowgill
Hi,

[drop some CCs since this is not directly related to the xmlto bug]

On 24/03/17 09:36, YunQiang Su wrote:
> On Fri, Mar 24, 2017 at 1:06 AM, James Cowgill  wrote:
>> I believe any of the following will fix this (but have not all been tested):
>> - Reduce the stack usage in xsltproc (the upstream bug)
>> - Upgrade the relevant buildds to Linux >= 4.1
>> - Apply d1fd836dcf00 to jessie's kernel
>> - Disable PIE in xsltproc.
>> - Run xsltproc inside setarch -L / setarch -R
>>
> 
> we have some trouble to run newer kernel on some Loongson machines,
> as their pmon can only load initrd with limit size.
> So backports patch may ideal for us, now.

Are you referring to this issue?
https://lists.debian.org/debian-mips/2016/01/msg9.html

This does only affect some Loongson machines. I think all the buildds
can be safely upgraded to 4.9 except for mipsel-manda-01 which has the
buggy PMON.

Looking at the bug again, all the extra .bss space comes from the giant
mem_section array which is always used on Loongson due to
STATIC_SPARSEMEM being enabled. I am wondering if this patch might help
(and if it works for multi-node Loongsons like the 3B). The Loongson
memory initialization code takes a different path to other mips
sub-arches and avoids calling memory_present until the very end, so it
might not need STATIC_SPARSEMEM?

(patch is completely untested)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 7baddfa0e229..3bbb454ab2f5 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2559,7 +2559,7 @@ config ARCH_DISCONTIGMEM_ENABLE

 config ARCH_SPARSEMEM_ENABLE
bool
-   select SPARSEMEM_STATIC
+   select SPARSEMEM_STATIC if !MACH_LOONGSON64

 config NUMA
bool "NUMA Support"

Thanks,
James



signature.asc
Description: OpenPGP digital signature


Re: Bug#858405: xmlto: intermittent Segmentation fault when building manpages for libreswan on mips64el

2017-03-24 Thread YunQiang Su
On Fri, Mar 24, 2017 at 1:06 AM, James Cowgill  wrote:
> reassign 858405 xsltproc
> forcemerge 750593 858405
> retitle 750593 xsltproc: bus error on some arches with linux < 4.1
> thanks
>
> Hi,
>
> On 22/03/17 21:01, Daniel Kahn Gillmor wrote:
>> On Wed 2017-03-22 06:22:41 -0400, James Cowgill wrote:
>>> On 22/03/17 01:29, Daniel Kahn Gillmor wrote:
 For debian revisions of 3.20, failures happened on:

   mipsel-manda-02
   eberlin

 Also for revisions of 3.20, successes happened on:

   mipsel-sil-01
   mipsel-manda-03
   mipsel-manda-01
>>>
>>> This is a known issue and it only affects Loongson buildds.
>>> Interestingly mipsel-manda-01 is Loongson and didn't fail there so there
>>> may be a random element involved here. I don't think anyone's tracked
>>> down the underlying issue though.
>>
>> thanks, is there a public reference for the known issue that we can
>> point to?
>
> I think #750593 looks a lot like the bug here.
>
> After some investigation, it seems I was being a bit unfair to Loongson.
> This is arguably a non mips specific bug in Linux < 4.1. It just so
> happens that all the Loongson buildds run jessie's 3.16 kernel and all
> the other buildds run >= 4.7 from backports.
>
> In #750593 there was lots of talk about stack overflows causing this but
> there is actually another element to this. Indeed if I reduced the stack
> size down with ulimit, the segfaults become more frequent, but
> increasing the stack size didn't help at all. After looking at the
> mappings for a failing process, I saw this (taken just after starting
> xsltproc):
>
> [...]
>> fff7f5-fff7f5c000 ---p 4000 fd:00 1060250
>> /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2
>> fff7f5c000-fff7f6 rw-p  fd:00 1060250
>> /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2
>> fff7f6-fff7f88000 r-xp  fd:00 1060375
>> /lib/mips64el-linux-gnuabi64/ld-2.24.so
>> fff7f94000-fff7f98000 rw-p 00024000 fd:00 1060375
>> /lib/mips64el-linux-gnuabi64/ld-2.24.so
>> fff7f98000-fff7fa r-xp  fd:00 947544 
>> /usr/bin/xsltproc
>> fff7fa4000-fff7fac000 rw-p  00:00 0
>> fff7fac000-fff7fb rw-p 4000 fd:00 947544 
>> /usr/bin/xsltproc
>> 1d4000-384000 rwxp  00:00 0  
>> [heap]
>> 9e-a04000 rwxp  00:00 0  
>> [stack]
>> ffc000-100 r-xp  00:00 0 
>> [vdso]
>
> Notice that there is a very small gap between the heap and the stack
> here (at least compared to working xsltproc runs). I think that the heap
> is growing to a point where it limits the maximum size of the stack and
> so increasing the stack size with ulimit doesn't help.
>
> The reason the program and the heap are at these very high addresses is
> that xsltproc is built with PIE and the kernel is treating the
> executable like a mmap and grouping it with all the other libraries. In
> d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") the behavior
> changed and now the program and it's heap will be mapped at a lower
> address so the bug does not affect newer kernels. Using "setarch -L" or
> "setarch -R" is another workaround for this bug because that moves the
> program so that there is a much larger gap between the heap and the stack.
>
> This might affect other applications as well. Effectively it means that
> PIE executables which use lots of stack space might not work properly
> with jessie's kernel. The chances the bug will be hit seems to vary
> between arches however (depending on what each arch does in
> arch_pick_mmap_layout and arch_randomize_brk) - mips64el seems to be hit
> pretty frequently. In xsltproc's case, PIE was enabled some time ago
> which is why this bug is quite old.
>
> I believe any of the following will fix this (but have not all been tested):
> - Reduce the stack usage in xsltproc (the upstream bug)
> - Upgrade the relevant buildds to Linux >= 4.1
> - Apply d1fd836dcf00 to jessie's kernel
> - Disable PIE in xsltproc.
> - Run xsltproc inside setarch -L / setarch -R
>

we have some trouble to run newer kernel on some Loongson machines,
as their pmon can only load initrd with limit size.
So backports patch may ideal for us, now.

>>> For the moment, I'll rebuild libreswan again and hope a good buildd is
>>> picked.
>>
>> i see 5 mips64el rebuilds now at
>> https://buildd.debian.org/status/logs.php?pkg=libreswan=3.20-6=sid,
>> but none of them have succeded yet :/
>>
>> 3 of the builds are from mipsel-manda-02, 1 is from eberlin, and one
>> additional new "bad" builder is:
>>
>>   mipsel-aql-01
>
> There are 3 non-Loongson buildds: mipsel-aql-03, mipsel-manda-03 and
> mipsel-sil-01. I expect libreswan will only build on one of those
> buildds at the 

Re: Bug#858405: xmlto: intermittent Segmentation fault when building manpages for libreswan on mips64el

2017-03-23 Thread James Cowgill
reassign 858405 xsltproc
forcemerge 750593 858405
retitle 750593 xsltproc: bus error on some arches with linux < 4.1
thanks

Hi,

On 22/03/17 21:01, Daniel Kahn Gillmor wrote:
> On Wed 2017-03-22 06:22:41 -0400, James Cowgill wrote:
>> On 22/03/17 01:29, Daniel Kahn Gillmor wrote:
>>> For debian revisions of 3.20, failures happened on:
>>>
>>>   mipsel-manda-02
>>>   eberlin
>>> 
>>> Also for revisions of 3.20, successes happened on:
>>>
>>>   mipsel-sil-01
>>>   mipsel-manda-03
>>>   mipsel-manda-01
>>
>> This is a known issue and it only affects Loongson buildds.
>> Interestingly mipsel-manda-01 is Loongson and didn't fail there so there
>> may be a random element involved here. I don't think anyone's tracked
>> down the underlying issue though.
> 
> thanks, is there a public reference for the known issue that we can
> point to?

I think #750593 looks a lot like the bug here.

After some investigation, it seems I was being a bit unfair to Loongson.
This is arguably a non mips specific bug in Linux < 4.1. It just so
happens that all the Loongson buildds run jessie's 3.16 kernel and all
the other buildds run >= 4.7 from backports.

In #750593 there was lots of talk about stack overflows causing this but
there is actually another element to this. Indeed if I reduced the stack
size down with ulimit, the segfaults become more frequent, but
increasing the stack size didn't help at all. After looking at the
mappings for a failing process, I saw this (taken just after starting
xsltproc):

[...]
> fff7f5-fff7f5c000 ---p 4000 fd:00 1060250
> /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2
> fff7f5c000-fff7f6 rw-p  fd:00 1060250
> /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2
> fff7f6-fff7f88000 r-xp  fd:00 1060375
> /lib/mips64el-linux-gnuabi64/ld-2.24.so
> fff7f94000-fff7f98000 rw-p 00024000 fd:00 1060375
> /lib/mips64el-linux-gnuabi64/ld-2.24.so
> fff7f98000-fff7fa r-xp  fd:00 947544 
> /usr/bin/xsltproc
> fff7fa4000-fff7fac000 rw-p  00:00 0
> fff7fac000-fff7fb rw-p 4000 fd:00 947544 
> /usr/bin/xsltproc
> 1d4000-384000 rwxp  00:00 0  
> [heap]
> 9e-a04000 rwxp  00:00 0  
> [stack]
> ffc000-100 r-xp  00:00 0 
> [vdso]

Notice that there is a very small gap between the heap and the stack
here (at least compared to working xsltproc runs). I think that the heap
is growing to a point where it limits the maximum size of the stack and
so increasing the stack size with ulimit doesn't help.

The reason the program and the heap are at these very high addresses is
that xsltproc is built with PIE and the kernel is treating the
executable like a mmap and grouping it with all the other libraries. In
d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") the behavior
changed and now the program and it's heap will be mapped at a lower
address so the bug does not affect newer kernels. Using "setarch -L" or
"setarch -R" is another workaround for this bug because that moves the
program so that there is a much larger gap between the heap and the stack.

This might affect other applications as well. Effectively it means that
PIE executables which use lots of stack space might not work properly
with jessie's kernel. The chances the bug will be hit seems to vary
between arches however (depending on what each arch does in
arch_pick_mmap_layout and arch_randomize_brk) - mips64el seems to be hit
pretty frequently. In xsltproc's case, PIE was enabled some time ago
which is why this bug is quite old.

I believe any of the following will fix this (but have not all been tested):
- Reduce the stack usage in xsltproc (the upstream bug)
- Upgrade the relevant buildds to Linux >= 4.1
- Apply d1fd836dcf00 to jessie's kernel
- Disable PIE in xsltproc.
- Run xsltproc inside setarch -L / setarch -R

>> For the moment, I'll rebuild libreswan again and hope a good buildd is
>> picked.
> 
> i see 5 mips64el rebuilds now at
> https://buildd.debian.org/status/logs.php?pkg=libreswan=3.20-6=sid,
> but none of them have succeded yet :/
> 
> 3 of the builds are from mipsel-manda-02, 1 is from eberlin, and one
> additional new "bad" builder is:
> 
>   mipsel-aql-01

There are 3 non-Loongson buildds: mipsel-aql-03, mipsel-manda-03 and
mipsel-sil-01. I expect libreswan will only build on one of those
buildds at the moment.

Thanks,
James



signature.asc
Description: OpenPGP digital signature