Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM - SUMMARY
On 21-1-2013 22:20, Michael Haberler wrote: Am 21.01.2013 um 20:10 schrieb Gilles Chanteperdrix: On 01/21/2013 02:32 PM, Michael Haberler wrote: Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix: question: does a RTC time warp have any possible bearing on Xenomai operations? No, it should not, Xenomai uses its own clock, which is set only once upon boot, so, is unaffected by Linux wallclock time changes... or should be. it might not be Xenomai after all. Uhum. the bughunt safari tribe has decided to focus on class 'duh' problems and resolves to shut up until red hands are spotted. I would still put the check in the timer "set_next_event" callback, just in case... I assume Bas will give the postmortem shortly - he nailed the issue; the RTC boot timewarp makes for a lost DHCP lease midflight and NFS freezing, making it look like a kernel hang. relieved, - Michael Michael said it all, there's not much for me to add. I'll summarize the case for the records ; ) Lesson learned: Change only one variable at a time and don't assume anything! I had been using a NFS mounted filesystem with the Beaglebone for over a year now without problems and got used to it's reliability (as I was used to in a corporate environment in the past). Because the Xenomai software was built with libraries (eabihf) not compatible with my (eabi) system I switched to the Ubuntu image Michael built, and everything seemed to work fine. Except that the (xenomai) kernel froze out after around 50-60 minutes of uptime. With the JTAG debugger I could see the kernel still running, but all applications (both text and X via SSH, and console via serial/USB connection) seemed frozen and there was no output indicating what was going on. Of course the xenomai kernel was the first suspect. But that proved to be a mistake. With hindsight, knowing the cause of the freeze now, I wonder why I haven't gotten the NFS connection time-out message on the console, but for some reason or another that isn't generated in this case. The underlying problem is that the Beaglebone has no battery backed real-time clock. This gives (only) a serious problem (freeze) with (1) a network mounted NFS root filesystem and (2) an initial kernel time lying in the past and (3) a DHCP lease time shorter than some multiple (in this case 2x) of the required system uptime. Ubuntu (and maybe Debian too) systems are obviously not designed to start with a completely wrong real-time clock value. And the dhclient (as many other programs) is not designed to handle the large time step that's generated once the clock is set properly sometime during the boot process. Note that if the filesystem is on local storage (e.g. FLASH or harddisk), there will only be a short disruption of the network connection and it's likely that the problem won't be noticed at all. A final solution hasn't been found yet: I prefer a workaround without changing the dhclient or some other standard program. I think it would suffice to acquire a new lease right after the time-step has been made. This has to be done without giving up the previous lease (that has expired because of the time-step), because that would cause the system to freeze again. Suggestions on how to do this are welcome. I can't spend much more time on this issue this week. -- Bas ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
Am 21.01.2013 um 20:10 schrieb Gilles Chanteperdrix: > On 01/21/2013 02:32 PM, Michael Haberler wrote: > >> >> Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix: question: does a RTC time warp have any possible bearing on Xenomai operations? >>> >>> >>> No, it should not, Xenomai uses its own clock, which is set only >>> once upon boot, so, is unaffected by Linux wallclock time >>> changes... or should be. >> >> >> it might not be Xenomai after all. Uhum. >> >> the bughunt safari tribe has decided to focus on class 'duh' problems >> and resolves to shut up until red hands are spotted. > > > I would still put the check in the timer "set_next_event" callback, just > in case... I assume Bas will give the postmortem shortly - he nailed the issue; the RTC boot timewarp makes for a lost DHCP lease midflight and NFS freezing, making it look like a kernel hang. relieved, - Michael ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/21/2013 02:32 PM, Michael Haberler wrote: > > Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix: > >> On 01/21/2013 12:43 PM, Michael Haberler wrote: >> >>> the suspicion now turned to the DHCP lease setting and RTC time >>> warp issues - the Beaglebone doesnt have an RTC so it starts up >>> at 1-1-1970 >>> >>> the first DHCP lease still has 1970 timestamps, but eventually >>> the RTC is set with ntpdate and it could be this causes >>> confusion >>> >>> the thing which is hard to believe for me: loss of IP >>> connectivity - conceivable; kernel hang - why? >>> >>> question: does a RTC time warp have any possible bearing on >>> Xenomai operations? >> >> >> No, it should not, Xenomai uses its own clock, which is set only >> once upon boot, so, is unaffected by Linux wallclock time >> changes... or should be. > > > it might not be Xenomai after all. Uhum. > > the bughunt safari tribe has decided to focus on class 'duh' problems > and resolves to shut up until red hands are spotted. I would still put the check in the timer "set_next_event" callback, just in case... -- Gilles. ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix: > On 01/21/2013 12:43 PM, Michael Haberler wrote: > >> the suspicion now turned to the DHCP lease setting and RTC time warp >> issues - the Beaglebone doesnt have an RTC so it starts up at >> 1-1-1970 >> >> the first DHCP lease still has 1970 timestamps, but eventually the >> RTC is set with ntpdate and it could be this causes confusion >> >> the thing which is hard to believe for me: loss of IP connectivity - >> conceivable; kernel hang - why? >> >> question: does a RTC time warp have any possible bearing on Xenomai >> operations? > > > No, it should not, Xenomai uses its own clock, which is set only once > upon boot, so, is unaffected by Linux wallclock time changes... or > should be. it might not be Xenomai after all. Uhum. the bughunt safari tribe has decided to focus on class 'duh' problems and resolves to shut up until red hands are spotted. -- btw the upgrade to the ipipe patch in master made all xeno-regression-test problems go away - thanks! -Michael ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/21/2013 12:43 PM, Michael Haberler wrote: > the suspicion now turned to the DHCP lease setting and RTC time warp > issues - the Beaglebone doesnt have an RTC so it starts up at > 1-1-1970 > > the first DHCP lease still has 1970 timestamps, but eventually the > RTC is set with ntpdate and it could be this causes confusion > > the thing which is hard to believe for me: loss of IP connectivity - > conceivable; kernel hang - why? > > question: does a RTC time warp have any possible bearing on Xenomai > operations? No, it should not, Xenomai uses its own clock, which is set only once upon boot, so, is unaffected by Linux wallclock time changes... or should be. -- Gilles. ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
the suspicion now turned to the DHCP lease setting and RTC time warp issues - the Beaglebone doesnt have an RTC so it starts up at 1-1-1970 the first DHCP lease still has 1970 timestamps, but eventually the RTC is set with ntpdate and it could be this causes confusion the thing which is hard to believe for me: loss of IP connectivity - conceivable; kernel hang - why? question: does a RTC time warp have any possible bearing on Xenomai operations? - Michael ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/19/2013 03:09 PM, Michael Haberler wrote: > > Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix: > >> On 01/17/2013 02:30 PM, Bas Laarhoven wrote: >> >>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote: On 01/17/2013 08:59 AM, Bas Laarhoven wrote: > On 16-1-2013 20:36, Michael Haberler wrote: >> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: >> >>> On 16-1-2013 15:15, Michael Haberler wrote: ARM work: Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got. Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me. >>> Hi Michael, >>> >>> Are you saying you don't have seen these 'sudden stops' yourself? >> No, never, after swapping to stronger power supplies; I have two of >> these boards running over NFS all the time. I dont have Linuxcnc running >> on them though, I'll do that and see if that changes the picture. Maybe >> keeping the torture test running helps trigger it. > Beginners error! :-P The power supply is indeed critical, but the > stepdown converter on my BeBoPr is dimensioned for at least 2A and > hasn't failed me yet. > > I think that running linuxcnc is mandatory for the lockup. After a dozen > runs, it looks like I can reproduce the lockup with 100% certainty > within one hour. > Using the JTAG interface to attach a debugger to the Bone, I've found > that once stalled the kernel is still running. It looks like it won't > schedule properly and almost all time is spent in the cpu_idle thread. This is typical of a tsc emulation or timer issue. On a system without anything running, please let the "tsc -w" command run. It will take some time to run (the wrap time of the hardware timer used for tsc emulation), if it runs correctly, then you need to check whether the timer is still running when the bug happens (cat /proc/xenomai/irq should continue increasing when for instance the latency test is running). If the timer is stopped, it may have been programmed for a too short delay, to avoid that, you can try: - increasing the ipipe_timer min_delay_ticks member (by default, it uses a value corresponding to the min_delta_ns member in the clockevent structure); - checking after programming the timer (in the set_next_event method) if the timer counter is already 0, in which case you can return a negative value, usually -ETIME. >>> >>> Hi Gilles, >>> >>> Thanks for the swift reply. >>> >>> As far as I can see, tsc -w runs without an error: >>> >>> ARM: counter wrap time: 179 seconds >>> Checking tsc for 6 minute(s) >>> min: 5, max: 12, avg: 5.04168 >>> ... >>> min: 5, max: 6, avg: 5.03771 >>> min: 5, max: 28, avg: 5.03989 -> 0.209995 us >>> >>> real6m0.284s >>> >>> I've also done the other regression tests and all were successful. >>> >>> Problem is that once the bug happens I won't be able to issue the cat >>> command. >>> I've fixed my debug setup so I don't have to use the System.map to >>> manually translate the debugger addresses : / >>> Now I'm waiting for another lockup to see what's happening. >> >> >> You may want to have a look at the xeno-regression-test script to put >> your system under pressure (and likely generate the lockup faster). > > running tsc -w and xeno-regression-test in parallel I get errors like so (not > on every run; no lockup so far): At this point we know that you do not have any issue with tsc emulation, so running tsc -w in parallel is useless. The point of running xeno-regression-test is to reach the "switchtest + switchtest -s + latency + ltp" point, where the system will be put under stress, and will be more likely to trigger a timer issue if there is one. So, if the tests before that do not pass, simply comment them in xeno-regression-test (xeno-regression-test is a shell script). Also note that if you are running a thumb2 user-space or running the kernel with CONFIG_THUMB2_KERNEL (which the segfault in sigdebug suggests), on a processor with a cortex a8 core, you need to enable CONFIG_ERRATA_430973, otherwise you will get random faults due to the correspondig processor erratum. --
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/19/2013 03:14 PM, Michael Haberler wrote: > that was xenomai 2.6.1 as per release tag in the git repo; the rest as > outlined here: > http://www.xenomai.org/pipermail/xenomai/2013-January/027164.html Please upgrade to xenomai master. You are having bug which have already been fixed since 2.6.1. > [502738.607343] switchtest: page allocation failure: order:4, mode:0xd0 That is an allocation failure. I am afraid you can run xeno-regression-test only once after the system boot (it is supposed to run for several hours anyway). -- Gilles. ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
Am 19.01.2013 um 15:10 schrieb Gilles Chanteperdrix: > On 01/19/2013 03:09 PM, Michael Haberler wrote: > >> >> Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix: >> >>> On 01/17/2013 02:30 PM, Bas Laarhoven wrote: >>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote: > On 01/17/2013 08:59 AM, Bas Laarhoven wrote: > >> On 16-1-2013 20:36, Michael Haberler wrote: >>> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: >>> On 16-1-2013 15:15, Michael Haberler wrote: > ARM work: > > Several people have been able to get the Beaglebone ubuntu/xenomai > setup working as outlined here: > http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup > I have updated the kernel and rootfs image a few days ago so the > kernel includes ext2/3/4 support compiled in, which should take care > of two failure reports I got. > > Again that xenomai kernel is based on 3.2.21; it works very stable > for me but there have been several reports of 'sudden stops'. The BB > is a bit sensitive to power fluctuations but it might be more than > that. As for that kernel, it works, but it is based on a branch which > will see no further development. It supports most of the stuff needed > to development; there might be some patches coming from more active > BB users than me. Hi Michael, Are you saying you don't have seen these 'sudden stops' yourself? >>> No, never, after swapping to stronger power supplies; I have two of >>> these boards running over NFS all the time. I dont have Linuxcnc >>> running on them though, I'll do that and see if that changes the >>> picture. Maybe keeping the torture test running helps trigger it. >> Beginners error! :-P The power supply is indeed critical, but the >> stepdown converter on my BeBoPr is dimensioned for at least 2A and >> hasn't failed me yet. >> >> I think that running linuxcnc is mandatory for the lockup. After a dozen >> runs, it looks like I can reproduce the lockup with 100% certainty >> within one hour. >> Using the JTAG interface to attach a debugger to the Bone, I've found >> that once stalled the kernel is still running. It looks like it won't >> schedule properly and almost all time is spent in the cpu_idle thread. > > This is typical of a tsc emulation or timer issue. On a system without > anything running, please let the "tsc -w" command run. It will take some > time to run (the wrap time of the hardware timer used for tsc > emulation), if it runs correctly, then you need to check whether the > timer is still running when the bug happens (cat /proc/xenomai/irq > should continue increasing when for instance the latency test is > running). If the timer is stopped, it may have been programmed for a too > short delay, to avoid that, you can try: > - increasing the ipipe_timer min_delay_ticks member (by default, it uses > a value corresponding to the min_delta_ns member in the clockevent > structure); > - checking after programming the timer (in the set_next_event method) if > the timer counter is already 0, in which case you can return a negative > value, usually -ETIME. > Hi Gilles, Thanks for the swift reply. As far as I can see, tsc -w runs without an error: ARM: counter wrap time: 179 seconds Checking tsc for 6 minute(s) min: 5, max: 12, avg: 5.04168 ... min: 5, max: 6, avg: 5.03771 min: 5, max: 28, avg: 5.03989 -> 0.209995 us real6m0.284s I've also done the other regression tests and all were successful. Problem is that once the bug happens I won't be able to issue the cat command. I've fixed my debug setup so I don't have to use the System.map to manually translate the debugger addresses : / Now I'm waiting for another lockup to see what's happening. >>> >>> >>> You may want to have a look at the xeno-regression-test script to put >>> your system under pressure (and likely generate the lockup faster). >> >> running tsc -w and xeno-regression-test in parallel I get errors like so >> (not on every run; no lockup so far): >> >> ++ /usr/xenomai/bin/mutex-torture-native >> simple_wait >> recursive_wait >> timed_mutex >> mode_switch >> pi_wait >> lock_stealing >> NOTE: lock_stealing mutex_trylock: not supported >> deny_stealing >> simple_condwait >> recursive_condwait >> auto_switchback >> FAILURE: current prio (0) != expected prio (2) >> >> dmesg >> [501963.390598] Xenomai: native: cleaning up mutex "" (ret=0). >> [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc >> >> on another run, I got a segfault while running sigdebug: >> ++ /usr/xenomai/bin/regression/native/sigdebug >> mayday page starting at 0x4
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/19/2013 03:09 PM, Michael Haberler wrote: > > Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix: > >> On 01/17/2013 02:30 PM, Bas Laarhoven wrote: >> >>> On 17-1-2013 9:53, Gilles Chanteperdrix wrote: On 01/17/2013 08:59 AM, Bas Laarhoven wrote: > On 16-1-2013 20:36, Michael Haberler wrote: >> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: >> >>> On 16-1-2013 15:15, Michael Haberler wrote: ARM work: Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got. Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me. >>> Hi Michael, >>> >>> Are you saying you don't have seen these 'sudden stops' yourself? >> No, never, after swapping to stronger power supplies; I have two of >> these boards running over NFS all the time. I dont have Linuxcnc running >> on them though, I'll do that and see if that changes the picture. Maybe >> keeping the torture test running helps trigger it. > Beginners error! :-P The power supply is indeed critical, but the > stepdown converter on my BeBoPr is dimensioned for at least 2A and > hasn't failed me yet. > > I think that running linuxcnc is mandatory for the lockup. After a dozen > runs, it looks like I can reproduce the lockup with 100% certainty > within one hour. > Using the JTAG interface to attach a debugger to the Bone, I've found > that once stalled the kernel is still running. It looks like it won't > schedule properly and almost all time is spent in the cpu_idle thread. This is typical of a tsc emulation or timer issue. On a system without anything running, please let the "tsc -w" command run. It will take some time to run (the wrap time of the hardware timer used for tsc emulation), if it runs correctly, then you need to check whether the timer is still running when the bug happens (cat /proc/xenomai/irq should continue increasing when for instance the latency test is running). If the timer is stopped, it may have been programmed for a too short delay, to avoid that, you can try: - increasing the ipipe_timer min_delay_ticks member (by default, it uses a value corresponding to the min_delta_ns member in the clockevent structure); - checking after programming the timer (in the set_next_event method) if the timer counter is already 0, in which case you can return a negative value, usually -ETIME. >>> >>> Hi Gilles, >>> >>> Thanks for the swift reply. >>> >>> As far as I can see, tsc -w runs without an error: >>> >>> ARM: counter wrap time: 179 seconds >>> Checking tsc for 6 minute(s) >>> min: 5, max: 12, avg: 5.04168 >>> ... >>> min: 5, max: 6, avg: 5.03771 >>> min: 5, max: 28, avg: 5.03989 -> 0.209995 us >>> >>> real6m0.284s >>> >>> I've also done the other regression tests and all were successful. >>> >>> Problem is that once the bug happens I won't be able to issue the cat >>> command. >>> I've fixed my debug setup so I don't have to use the System.map to >>> manually translate the debugger addresses : / >>> Now I'm waiting for another lockup to see what's happening. >> >> >> You may want to have a look at the xeno-regression-test script to put >> your system under pressure (and likely generate the lockup faster). > > running tsc -w and xeno-regression-test in parallel I get errors like so (not > on every run; no lockup so far): > > ++ /usr/xenomai/bin/mutex-torture-native > simple_wait > recursive_wait > timed_mutex > mode_switch > pi_wait > lock_stealing > NOTE: lock_stealing mutex_trylock: not supported > deny_stealing > simple_condwait > recursive_condwait > auto_switchback > FAILURE: current prio (0) != expected prio (2) > > dmesg > [501963.390598] Xenomai: native: cleaning up mutex "" (ret=0). > [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc > > on another run, I got a segfault while running sigdebug: > ++ /usr/xenomai/bin/regression/native/sigdebug > mayday page starting at 0x400eb000 [/dev/rtheap] > mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b > 02 00 0a 42 00 0f 00 db d7 ee b8 > mlockall > syscall > signal > relaxed mutex owner >
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix: > On 01/17/2013 02:30 PM, Bas Laarhoven wrote: > >> On 17-1-2013 9:53, Gilles Chanteperdrix wrote: >>> On 01/17/2013 08:59 AM, Bas Laarhoven wrote: >>> On 16-1-2013 20:36, Michael Haberler wrote: > Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: > >> On 16-1-2013 15:15, Michael Haberler wrote: >>> ARM work: >>> >>> Several people have been able to get the Beaglebone ubuntu/xenomai >>> setup working as outlined here: >>> http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup >>> I have updated the kernel and rootfs image a few days ago so the kernel >>> includes ext2/3/4 support compiled in, which should take care of two >>> failure reports I got. >>> >>> Again that xenomai kernel is based on 3.2.21; it works very stable for >>> me but there have been several reports of 'sudden stops'. The BB is a >>> bit sensitive to power fluctuations but it might be more than that. As >>> for that kernel, it works, but it is based on a branch which will see >>> no further development. It supports most of the stuff needed to >>> development; there might be some patches coming from more active BB >>> users than me. >> Hi Michael, >> >> Are you saying you don't have seen these 'sudden stops' yourself? > No, never, after swapping to stronger power supplies; I have two of these > boards running over NFS all the time. I dont have Linuxcnc running on > them though, I'll do that and see if that changes the picture. Maybe > keeping the torture test running helps trigger it. Beginners error! :-P The power supply is indeed critical, but the stepdown converter on my BeBoPr is dimensioned for at least 2A and hasn't failed me yet. I think that running linuxcnc is mandatory for the lockup. After a dozen runs, it looks like I can reproduce the lockup with 100% certainty within one hour. Using the JTAG interface to attach a debugger to the Bone, I've found that once stalled the kernel is still running. It looks like it won't schedule properly and almost all time is spent in the cpu_idle thread. >>> >>> This is typical of a tsc emulation or timer issue. On a system without >>> anything running, please let the "tsc -w" command run. It will take some >>> time to run (the wrap time of the hardware timer used for tsc >>> emulation), if it runs correctly, then you need to check whether the >>> timer is still running when the bug happens (cat /proc/xenomai/irq >>> should continue increasing when for instance the latency test is >>> running). If the timer is stopped, it may have been programmed for a too >>> short delay, to avoid that, you can try: >>> - increasing the ipipe_timer min_delay_ticks member (by default, it uses >>> a value corresponding to the min_delta_ns member in the clockevent >>> structure); >>> - checking after programming the timer (in the set_next_event method) if >>> the timer counter is already 0, in which case you can return a negative >>> value, usually -ETIME. >>> >> >> Hi Gilles, >> >> Thanks for the swift reply. >> >> As far as I can see, tsc -w runs without an error: >> >> ARM: counter wrap time: 179 seconds >> Checking tsc for 6 minute(s) >> min: 5, max: 12, avg: 5.04168 >> ... >> min: 5, max: 6, avg: 5.03771 >> min: 5, max: 28, avg: 5.03989 -> 0.209995 us >> >> real6m0.284s >> >> I've also done the other regression tests and all were successful. >> >> Problem is that once the bug happens I won't be able to issue the cat >> command. >> I've fixed my debug setup so I don't have to use the System.map to >> manually translate the debugger addresses : / >> Now I'm waiting for another lockup to see what's happening. > > > You may want to have a look at the xeno-regression-test script to put > your system under pressure (and likely generate the lockup faster). running tsc -w and xeno-regression-test in parallel I get errors like so (not on every run; no lockup so far): ++ /usr/xenomai/bin/mutex-torture-native simple_wait recursive_wait timed_mutex mode_switch pi_wait lock_stealing NOTE: lock_stealing mutex_trylock: not supported deny_stealing simple_condwait recursive_condwait auto_switchback FAILURE: current prio (0) != expected prio (2) dmesg [501963.390598] Xenomai: native: cleaning up mutex "" (ret=0). [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc on another run, I got a segfault while running sigdebug: ++ /usr/xenomai/bin/regression/native/sigdebug mayday page starting at 0x400eb000 [/dev/rtheap] mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 02 00 0a 42 00 0f 00 db d7 ee b8 mlockall syscall signal relaxed mutex owner page fault watchdog ./xeno-regression-test: line 53: 4210 Segmentation fault /usr/xenomai/bin/regression/native/sigdebug root@bb1:/usr/xenomai/bin# dmesg [502442.312996] Xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/17/2013 02:30 PM, Bas Laarhoven wrote: > On 17-1-2013 9:53, Gilles Chanteperdrix wrote: >> On 01/17/2013 08:59 AM, Bas Laarhoven wrote: >> >>> On 16-1-2013 20:36, Michael Haberler wrote: Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: > On 16-1-2013 15:15, Michael Haberler wrote: >> ARM work: >> >> Several people have been able to get the Beaglebone ubuntu/xenomai setup >> working as outlined here: >> http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup >> I have updated the kernel and rootfs image a few days ago so the kernel >> includes ext2/3/4 support compiled in, which should take care of two >> failure reports I got. >> >> Again that xenomai kernel is based on 3.2.21; it works very stable for >> me but there have been several reports of 'sudden stops'. The BB is a >> bit sensitive to power fluctuations but it might be more than that. As >> for that kernel, it works, but it is based on a branch which will see no >> further development. It supports most of the stuff needed to >> development; there might be some patches coming from more active BB >> users than me. > Hi Michael, > > Are you saying you don't have seen these 'sudden stops' yourself? No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it. >>> Beginners error! :-P The power supply is indeed critical, but the >>> stepdown converter on my BeBoPr is dimensioned for at least 2A and >>> hasn't failed me yet. >>> >>> I think that running linuxcnc is mandatory for the lockup. After a dozen >>> runs, it looks like I can reproduce the lockup with 100% certainty >>> within one hour. >>> Using the JTAG interface to attach a debugger to the Bone, I've found >>> that once stalled the kernel is still running. It looks like it won't >>> schedule properly and almost all time is spent in the cpu_idle thread. >> >> This is typical of a tsc emulation or timer issue. On a system without >> anything running, please let the "tsc -w" command run. It will take some >> time to run (the wrap time of the hardware timer used for tsc >> emulation), if it runs correctly, then you need to check whether the >> timer is still running when the bug happens (cat /proc/xenomai/irq >> should continue increasing when for instance the latency test is >> running). If the timer is stopped, it may have been programmed for a too >> short delay, to avoid that, you can try: >> - increasing the ipipe_timer min_delay_ticks member (by default, it uses >> a value corresponding to the min_delta_ns member in the clockevent >> structure); >> - checking after programming the timer (in the set_next_event method) if >> the timer counter is already 0, in which case you can return a negative >> value, usually -ETIME. >> > > Hi Gilles, > > Thanks for the swift reply. > > As far as I can see, tsc -w runs without an error: > > ARM: counter wrap time: 179 seconds > Checking tsc for 6 minute(s) > min: 5, max: 12, avg: 5.04168 > ... > min: 5, max: 6, avg: 5.03771 > min: 5, max: 28, avg: 5.03989 -> 0.209995 us > > real6m0.284s > > I've also done the other regression tests and all were successful. > > Problem is that once the bug happens I won't be able to issue the cat > command. > I've fixed my debug setup so I don't have to use the System.map to > manually translate the debugger addresses : / > Now I'm waiting for another lockup to see what's happening. You may want to have a look at the xeno-regression-test script to put your system under pressure (and likely generate the lockup faster). -- Gilles. ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 17-1-2013 9:53, Gilles Chanteperdrix wrote: On 01/17/2013 08:59 AM, Bas Laarhoven wrote: On 16-1-2013 20:36, Michael Haberler wrote: Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: On 16-1-2013 15:15, Michael Haberler wrote: ARM work: Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got. Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me. Hi Michael, Are you saying you don't have seen these 'sudden stops' yourself? No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it. Beginners error! :-P The power supply is indeed critical, but the stepdown converter on my BeBoPr is dimensioned for at least 2A and hasn't failed me yet. I think that running linuxcnc is mandatory for the lockup. After a dozen runs, it looks like I can reproduce the lockup with 100% certainty within one hour. Using the JTAG interface to attach a debugger to the Bone, I've found that once stalled the kernel is still running. It looks like it won't schedule properly and almost all time is spent in the cpu_idle thread. This is typical of a tsc emulation or timer issue. On a system without anything running, please let the "tsc -w" command run. It will take some time to run (the wrap time of the hardware timer used for tsc emulation), if it runs correctly, then you need to check whether the timer is still running when the bug happens (cat /proc/xenomai/irq should continue increasing when for instance the latency test is running). If the timer is stopped, it may have been programmed for a too short delay, to avoid that, you can try: - increasing the ipipe_timer min_delay_ticks member (by default, it uses a value corresponding to the min_delta_ns member in the clockevent structure); - checking after programming the timer (in the set_next_event method) if the timer counter is already 0, in which case you can return a negative value, usually -ETIME. Hi Gilles, Thanks for the swift reply. As far as I can see, tsc -w runs without an error: ARM: counter wrap time: 179 seconds Checking tsc for 6 minute(s) min: 5, max: 12, avg: 5.04168 ... min: 5, max: 6, avg: 5.03771 min: 5, max: 28, avg: 5.03989 -> 0.209995 us real6m0.284s I've also done the other regression tests and all were successful. Problem is that once the bug happens I won't be able to issue the cat command. I've fixed my debug setup so I don't have to use the System.map to manually translate the debugger addresses : / Now I'm waiting for another lockup to see what's happening. -- Bas ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/17/2013 12:34 PM, Michael Haberler wrote: > Gilles, > > Am 17.01.2013 um 09:53 schrieb Gilles Chanteperdrix: > >> On 01/17/2013 08:59 AM, Bas Laarhoven wrote: >> >>> On 16-1-2013 20:36, Michael Haberler wrote: > > > Are you saying you don't have seen these 'sudden stops' yourself? > > ... >> This is typical of a tsc emulation or timer issue. On a system without >> anything running, please let the "tsc -w" command run. It will take some >> time to run (the wrap time of the hardware timer used for tsc >> emulation), if it runs correctly, then you need to check whether the >> timer is still running when the bug happens (cat /proc/xenomai/irq >> should continue increasing when for instance the latency test is >> running). If the timer is stopped, it may have been programmed for a too >> short delay, to avoid that, you can try: >> - increasing the ipipe_timer min_delay_ticks member (by default, it uses >> a value corresponding to the min_delta_ns member in the clockevent >> structure); >> - checking after programming the timer (in the set_next_event method) if >> the timer counter is already 0, in which case you can return a negative >> value, usually -ETIME. Actually, the hardware counter will be 0x when the timer has reached delay. -- Gilles. ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
Gilles, Am 17.01.2013 um 09:53 schrieb Gilles Chanteperdrix: > On 01/17/2013 08:59 AM, Bas Laarhoven wrote: > >> On 16-1-2013 20:36, Michael Haberler wrote: Are you saying you don't have seen these 'sudden stops' yourself? ... > This is typical of a tsc emulation or timer issue. On a system without > anything running, please let the "tsc -w" command run. It will take some > time to run (the wrap time of the hardware timer used for tsc > emulation), if it runs correctly, then you need to check whether the > timer is still running when the bug happens (cat /proc/xenomai/irq > should continue increasing when for instance the latency test is > running). If the timer is stopped, it may have been programmed for a too > short delay, to avoid that, you can try: > - increasing the ipipe_timer min_delay_ticks member (by default, it uses > a value corresponding to the min_delta_ns member in the clockevent > structure); > - checking after programming the timer (in the set_next_event method) if > the timer counter is already 0, in which case you can return a negative > value, usually -ETIME. thanks for that most valuable hint. The bughunt safari is on, debuggers loaded and JTAG's armed ;) - Michael > > > -- > Gilles. > > ___ > Xenomai mailing list > Xenomai@xenomai.org > http://www.xenomai.org/mailman/listinfo/xenomai ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 01/17/2013 08:59 AM, Bas Laarhoven wrote: > On 16-1-2013 20:36, Michael Haberler wrote: >> Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: >> >>> On 16-1-2013 15:15, Michael Haberler wrote: ARM work: Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got. Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me. >>> Hi Michael, >>> >>> Are you saying you don't have seen these 'sudden stops' yourself? >> No, never, after swapping to stronger power supplies; I have two of these >> boards running over NFS all the time. I dont have Linuxcnc running on them >> though, I'll do that and see if that changes the picture. Maybe keeping the >> torture test running helps trigger it. > > Beginners error! :-P The power supply is indeed critical, but the > stepdown converter on my BeBoPr is dimensioned for at least 2A and > hasn't failed me yet. > > I think that running linuxcnc is mandatory for the lockup. After a dozen > runs, it looks like I can reproduce the lockup with 100% certainty > within one hour. > Using the JTAG interface to attach a debugger to the Bone, I've found > that once stalled the kernel is still running. It looks like it won't > schedule properly and almost all time is spent in the cpu_idle thread. This is typical of a tsc emulation or timer issue. On a system without anything running, please let the "tsc -w" command run. It will take some time to run (the wrap time of the hardware timer used for tsc emulation), if it runs correctly, then you need to check whether the timer is still running when the bug happens (cat /proc/xenomai/irq should continue increasing when for instance the latency test is running). If the timer is stopped, it may have been programmed for a too short delay, to avoid that, you can try: - increasing the ipipe_timer min_delay_ticks member (by default, it uses a value corresponding to the min_delta_ns member in the clockevent structure); - checking after programming the timer (in the set_next_event method) if the timer counter is already 0, in which case you can return a negative value, usually -ETIME. -- Gilles. ___ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai
Re: [Xenomai] [Emc-developers] "new RTOS" status: Scheduler (?) lockup on ARM
On 16-1-2013 20:36, Michael Haberler wrote: Am 16.01.2013 um 17:45 schrieb Bas Laarhoven: On 16-1-2013 15:15, Michael Haberler wrote: ARM work: Several people have been able to get the Beaglebone ubuntu/xenomai setup working as outlined here: http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup I have updated the kernel and rootfs image a few days ago so the kernel includes ext2/3/4 support compiled in, which should take care of two failure reports I got. Again that xenomai kernel is based on 3.2.21; it works very stable for me but there have been several reports of 'sudden stops'. The BB is a bit sensitive to power fluctuations but it might be more than that. As for that kernel, it works, but it is based on a branch which will see no further development. It supports most of the stuff needed to development; there might be some patches coming from more active BB users than me. Hi Michael, Are you saying you don't have seen these 'sudden stops' yourself? No, never, after swapping to stronger power supplies; I have two of these boards running over NFS all the time. I dont have Linuxcnc running on them though, I'll do that and see if that changes the picture. Maybe keeping the torture test running helps trigger it. Beginners error! :-P The power supply is indeed critical, but the stepdown converter on my BeBoPr is dimensioned for at least 2A and hasn't failed me yet. I think that running linuxcnc is mandatory for the lockup. After a dozen runs, it looks like I can reproduce the lockup with 100% certainty within one hour. Using the JTAG interface to attach a debugger to the Bone, I've found that once stalled the kernel is still running. It looks like it won't schedule properly and almost all time is spent in the cpu_idle thread. The kernel with extra diagnostics produces these messages: [ 3480.386342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3480.395913] INFO: task axis:799 blocked for more than 120 seconds. [ 3480.406643] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3600.408670] INFO: task hal_manualtoolc:788 blocked for more than 120 seconds. On one run I was able to re-issue a command from the command history before that console froze too. Since the x86 version seems to be having none of these problems, it might be ARM specific. Any suggestions on how to proceed? Are other people working on the ARM version? I'm also sending this message to the xenomai mailing list as that might be a better place to resume this thread. -- Bas NB there is an ipipe trace option, but that doesnt help if you cant talk to the damn thing. My system has frozen within one hour every time. I'm aware of the power supply issues, but my configuration has _never_ experienced this problem over at least half a year of (heavy) use. just to clarifiy: you get the lockups only with the Xenomai kernel, I assume ? your other option is some Angström kernel or what exactly (isn't the list of options bewildering ;-?) So I dare say that isn't the problem, at least not with my lock-ups I'm seeing. Currently I'm debugging the kernel to see what's going on. It looks like the kernel is idling, but the system is completely frozen (blocked, not scheduling?). I've built a kernel with symbols a lot of extra debug options and am waiting for it to stop again right now. It's been running axis with the demo for almost an hour, the best result up to now... Do you have an opinion on what would be the best kernel version for (future) development? Is Xenomai up with the current kernels? Are the DT kernels usable on the bone or do we have to wait another couple of months for that? again it's a question of matching a Xenomai patch version with a stable base version, and have the itimer support in it - that's what reduces the range of options there are several base versions one could try; the integration towards mainline is now targeted at 3.8 and it seems the stock kernel has much of what is needed including PRUSS. It's also possible that the current Xenomai work for a 3.5.x base results in a match, I need to look into it. I was suggested to 'forward port the ipipe patch myself' but I chickened out on that one. summary: I'm pretty sure there is; I am not aware of tangible results. I will push the two patches I got from Stephan Kappertz and Sheng Chao Wong, I dont think they are online. - Michael -- Bas Yes! Frozen Bone after 56 minutes uptime : ) Time to start debugging again! Charles has done some great work for a high-speed stepgen on the Beaglebone, and a few folks have reproduced that, but I leave the fanfare to Charles here;) I have done no further work on the Raspberry, I do not consider that platform particularly useful to base work on. RTAI note: I was pointed to this thread recently, which is interesting to read for several reasons: https://mail.rtai.org/pipermail/rtai/2012-December/thread.html