Lemote-3a-itx-a1101 kernel/PMON bug (Was: Bug#858405: xmlto: intermittent Segmentation fault when building manpages for libreswan on mips64el)
Hi, [drop some CCs since this is not directly related to the xmlto bug] On 24/03/17 09:36, YunQiang Su wrote: > On Fri, Mar 24, 2017 at 1:06 AM, James Cowgillwrote: >> I believe any of the following will fix this (but have not all been tested): >> - Reduce the stack usage in xsltproc (the upstream bug) >> - Upgrade the relevant buildds to Linux >= 4.1 >> - Apply d1fd836dcf00 to jessie's kernel >> - Disable PIE in xsltproc. >> - Run xsltproc inside setarch -L / setarch -R >> > > we have some trouble to run newer kernel on some Loongson machines, > as their pmon can only load initrd with limit size. > So backports patch may ideal for us, now. Are you referring to this issue? https://lists.debian.org/debian-mips/2016/01/msg9.html This does only affect some Loongson machines. I think all the buildds can be safely upgraded to 4.9 except for mipsel-manda-01 which has the buggy PMON. Looking at the bug again, all the extra .bss space comes from the giant mem_section array which is always used on Loongson due to STATIC_SPARSEMEM being enabled. I am wondering if this patch might help (and if it works for multi-node Loongsons like the 3B). The Loongson memory initialization code takes a different path to other mips sub-arches and avoids calling memory_present until the very end, so it might not need STATIC_SPARSEMEM? (patch is completely untested) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 7baddfa0e229..3bbb454ab2f5 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2559,7 +2559,7 @@ config ARCH_DISCONTIGMEM_ENABLE config ARCH_SPARSEMEM_ENABLE bool - select SPARSEMEM_STATIC + select SPARSEMEM_STATIC if !MACH_LOONGSON64 config NUMA bool "NUMA Support" Thanks, James signature.asc Description: OpenPGP digital signature
Re: Bug#858405: xmlto: intermittent Segmentation fault when building manpages for libreswan on mips64el
On Fri, Mar 24, 2017 at 1:06 AM, James Cowgillwrote: > reassign 858405 xsltproc > forcemerge 750593 858405 > retitle 750593 xsltproc: bus error on some arches with linux < 4.1 > thanks > > Hi, > > On 22/03/17 21:01, Daniel Kahn Gillmor wrote: >> On Wed 2017-03-22 06:22:41 -0400, James Cowgill wrote: >>> On 22/03/17 01:29, Daniel Kahn Gillmor wrote: For debian revisions of 3.20, failures happened on: mipsel-manda-02 eberlin Also for revisions of 3.20, successes happened on: mipsel-sil-01 mipsel-manda-03 mipsel-manda-01 >>> >>> This is a known issue and it only affects Loongson buildds. >>> Interestingly mipsel-manda-01 is Loongson and didn't fail there so there >>> may be a random element involved here. I don't think anyone's tracked >>> down the underlying issue though. >> >> thanks, is there a public reference for the known issue that we can >> point to? > > I think #750593 looks a lot like the bug here. > > After some investigation, it seems I was being a bit unfair to Loongson. > This is arguably a non mips specific bug in Linux < 4.1. It just so > happens that all the Loongson buildds run jessie's 3.16 kernel and all > the other buildds run >= 4.7 from backports. > > In #750593 there was lots of talk about stack overflows causing this but > there is actually another element to this. Indeed if I reduced the stack > size down with ulimit, the segfaults become more frequent, but > increasing the stack size didn't help at all. After looking at the > mappings for a failing process, I saw this (taken just after starting > xsltproc): > > [...] >> fff7f5-fff7f5c000 ---p 4000 fd:00 1060250 >> /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2 >> fff7f5c000-fff7f6 rw-p fd:00 1060250 >> /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2 >> fff7f6-fff7f88000 r-xp fd:00 1060375 >> /lib/mips64el-linux-gnuabi64/ld-2.24.so >> fff7f94000-fff7f98000 rw-p 00024000 fd:00 1060375 >> /lib/mips64el-linux-gnuabi64/ld-2.24.so >> fff7f98000-fff7fa r-xp fd:00 947544 >> /usr/bin/xsltproc >> fff7fa4000-fff7fac000 rw-p 00:00 0 >> fff7fac000-fff7fb rw-p 4000 fd:00 947544 >> /usr/bin/xsltproc >> 1d4000-384000 rwxp 00:00 0 >> [heap] >> 9e-a04000 rwxp 00:00 0 >> [stack] >> ffc000-100 r-xp 00:00 0 >> [vdso] > > Notice that there is a very small gap between the heap and the stack > here (at least compared to working xsltproc runs). I think that the heap > is growing to a point where it limits the maximum size of the stack and > so increasing the stack size with ulimit doesn't help. > > The reason the program and the heap are at these very high addresses is > that xsltproc is built with PIE and the kernel is treating the > executable like a mmap and grouping it with all the other libraries. In > d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") the behavior > changed and now the program and it's heap will be mapped at a lower > address so the bug does not affect newer kernels. Using "setarch -L" or > "setarch -R" is another workaround for this bug because that moves the > program so that there is a much larger gap between the heap and the stack. > > This might affect other applications as well. Effectively it means that > PIE executables which use lots of stack space might not work properly > with jessie's kernel. The chances the bug will be hit seems to vary > between arches however (depending on what each arch does in > arch_pick_mmap_layout and arch_randomize_brk) - mips64el seems to be hit > pretty frequently. In xsltproc's case, PIE was enabled some time ago > which is why this bug is quite old. > > I believe any of the following will fix this (but have not all been tested): > - Reduce the stack usage in xsltproc (the upstream bug) > - Upgrade the relevant buildds to Linux >= 4.1 > - Apply d1fd836dcf00 to jessie's kernel > - Disable PIE in xsltproc. > - Run xsltproc inside setarch -L / setarch -R > we have some trouble to run newer kernel on some Loongson machines, as their pmon can only load initrd with limit size. So backports patch may ideal for us, now. >>> For the moment, I'll rebuild libreswan again and hope a good buildd is >>> picked. >> >> i see 5 mips64el rebuilds now at >> https://buildd.debian.org/status/logs.php?pkg=libreswan=3.20-6=sid, >> but none of them have succeded yet :/ >> >> 3 of the builds are from mipsel-manda-02, 1 is from eberlin, and one >> additional new "bad" builder is: >> >> mipsel-aql-01 > > There are 3 non-Loongson buildds: mipsel-aql-03, mipsel-manda-03 and > mipsel-sil-01. I expect libreswan will only build on one of those > buildds at the
Re: Bug#858405: xmlto: intermittent Segmentation fault when building manpages for libreswan on mips64el
reassign 858405 xsltproc forcemerge 750593 858405 retitle 750593 xsltproc: bus error on some arches with linux < 4.1 thanks Hi, On 22/03/17 21:01, Daniel Kahn Gillmor wrote: > On Wed 2017-03-22 06:22:41 -0400, James Cowgill wrote: >> On 22/03/17 01:29, Daniel Kahn Gillmor wrote: >>> For debian revisions of 3.20, failures happened on: >>> >>> mipsel-manda-02 >>> eberlin >>> >>> Also for revisions of 3.20, successes happened on: >>> >>> mipsel-sil-01 >>> mipsel-manda-03 >>> mipsel-manda-01 >> >> This is a known issue and it only affects Loongson buildds. >> Interestingly mipsel-manda-01 is Loongson and didn't fail there so there >> may be a random element involved here. I don't think anyone's tracked >> down the underlying issue though. > > thanks, is there a public reference for the known issue that we can > point to? I think #750593 looks a lot like the bug here. After some investigation, it seems I was being a bit unfair to Loongson. This is arguably a non mips specific bug in Linux < 4.1. It just so happens that all the Loongson buildds run jessie's 3.16 kernel and all the other buildds run >= 4.7 from backports. In #750593 there was lots of talk about stack overflows causing this but there is actually another element to this. Indeed if I reduced the stack size down with ulimit, the segfaults become more frequent, but increasing the stack size didn't help at all. After looking at the mappings for a failing process, I saw this (taken just after starting xsltproc): [...] > fff7f5-fff7f5c000 ---p 4000 fd:00 1060250 > /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2 > fff7f5c000-fff7f6 rw-p fd:00 1060250 > /usr/lib/mips64el-linux-gnuabi64/libeatmydata.so.1.1.2 > fff7f6-fff7f88000 r-xp fd:00 1060375 > /lib/mips64el-linux-gnuabi64/ld-2.24.so > fff7f94000-fff7f98000 rw-p 00024000 fd:00 1060375 > /lib/mips64el-linux-gnuabi64/ld-2.24.so > fff7f98000-fff7fa r-xp fd:00 947544 > /usr/bin/xsltproc > fff7fa4000-fff7fac000 rw-p 00:00 0 > fff7fac000-fff7fb rw-p 4000 fd:00 947544 > /usr/bin/xsltproc > 1d4000-384000 rwxp 00:00 0 > [heap] > 9e-a04000 rwxp 00:00 0 > [stack] > ffc000-100 r-xp 00:00 0 > [vdso] Notice that there is a very small gap between the heap and the stack here (at least compared to working xsltproc runs). I think that the heap is growing to a point where it limits the maximum size of the stack and so increasing the stack size with ulimit doesn't help. The reason the program and the heap are at these very high addresses is that xsltproc is built with PIE and the kernel is treating the executable like a mmap and grouping it with all the other libraries. In d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") the behavior changed and now the program and it's heap will be mapped at a lower address so the bug does not affect newer kernels. Using "setarch -L" or "setarch -R" is another workaround for this bug because that moves the program so that there is a much larger gap between the heap and the stack. This might affect other applications as well. Effectively it means that PIE executables which use lots of stack space might not work properly with jessie's kernel. The chances the bug will be hit seems to vary between arches however (depending on what each arch does in arch_pick_mmap_layout and arch_randomize_brk) - mips64el seems to be hit pretty frequently. In xsltproc's case, PIE was enabled some time ago which is why this bug is quite old. I believe any of the following will fix this (but have not all been tested): - Reduce the stack usage in xsltproc (the upstream bug) - Upgrade the relevant buildds to Linux >= 4.1 - Apply d1fd836dcf00 to jessie's kernel - Disable PIE in xsltproc. - Run xsltproc inside setarch -L / setarch -R >> For the moment, I'll rebuild libreswan again and hope a good buildd is >> picked. > > i see 5 mips64el rebuilds now at > https://buildd.debian.org/status/logs.php?pkg=libreswan=3.20-6=sid, > but none of them have succeded yet :/ > > 3 of the builds are from mipsel-manda-02, 1 is from eberlin, and one > additional new "bad" builder is: > > mipsel-aql-01 There are 3 non-Loongson buildds: mipsel-aql-03, mipsel-manda-03 and mipsel-sil-01. I expect libreswan will only build on one of those buildds at the moment. Thanks, James signature.asc Description: OpenPGP digital signature