Hi Alex Am 29.07.2015 um 03:19 schrieb Alexander Kroh: > Hi Robert, > > Did you end up getting to the bottom of this issue? >
No, unfortunately I got somewhat overwhelmed with other stuff :-( > We still have not fixed it... But the good news is that we have finally been > able to reproduce it! > We find the same symptoms on the Sabre Light board with a different version > of u-boot > That would support my theory that Uboot somehow causes the error. Do you know if there is a way to just clear any stale pending data abort conditions during startup prior to dropping the mask? Robert > - Alex > > ________________________________________ > From: Robert Kaiser [robert.kai...@hs-rm.de] > Sent: Monday, 23 March 2015 08:38 > To: Alexander Kroh; devel@sel4.systems > Subject: Re: [seL4] Wandboard Port > > Hi Alex, > > Am 22.03.2015 um 07:08 schrieb Alexander Kroh: >> Hi Robert, >> >> Thanks for pointing that out, we will fix those hard coded values ASAP. > Attached is a patch that should do this. Hope its OK. > > >> >> From the B3.13.4 in the ARMv7 manual, in the case of an async abort, the >> DFAR (fault address) can't be trusted either. >> It seems like the only way to get to the bottom of this is to use "isb" >> instructions to force any pending async abort to occur before execution can >> continue to an arbitrary point. >> Unless there is a way to turn off the load/store buffers? > > It seems to me like the async abort signal is completely bogus in my > case: It hits immediatly as soon as the mask is dropped in CPSR, no > matter what the CPU is doing. The idle thread that crashed in the last > test is simply an endless loop that does not load or store any data. All > it does is opcode fetches. And if the async abort is kept masked, the > entire testsuite completes successfully. I doubt this would be the case > if loads or stores really failed (which -after all- is what the async > abort is supposed to indicate, as far as I understand). > > My current hypothesis is that this is due to some weird misconfiguration > by U-boot. > > >> >> I will be very interested to hear what the underlying cause was when you >> work this one out :) > I hope i will .... > > > Cheers > > Robert > > >> - Alex >> >> >> ________________________________________ >> From: Robert Kaiser [robert.kai...@hs-rm.de] >> Sent: Sunday, 22 March 2015 00:40 >> To: Alexander Kroh; devel@sel4.systems >> Subject: Re: [seL4] Wandboard Port >> >> Hi, >> >> Am 21.03.2015 um 04:52 schrieb Alexander Kroh: >>> You have only disabled async aborts. It sounds like the kernel is following >>> a bad pointer which leads to a translation failure. The good news is that >>> the fault address provided can actually be trusted. >>> Note that when you mask async aborts, writes to the invalid physical >>> address are ignored and reads are always 0. It the kernel following a NULL >>> pointer? >> Here are the final words of a test run: >> >> --------------------- >> </system-out> >> </testcase> >> <testcase classname="sel4test" name="TEST_DOMAINS0004"> >> INFO :sel4utils_elf_load_record_regions:270: * Loading segment >> 00008000-->0004$ >> INFO :sel4utils_elf_load_record_regions:270: * Loading segment >> 0004d170-->0016$ >> Running test DOMAINS0004 (Run threads in domains()) >> >> >> KERNEL DATA ABORT! >> Faulting instruction: 0xe00107d8 >> FAR: 0x1f11c2e0 DFSR: 0x1c06 >> halting... >> --------------------- >> >> The fault address (FAR) is never NULL. Worse even, its value varies >> from test to test. Values I have seen are: 0x1f11c2e0 (most of the >> time), 0x9f11c2e0 (often), 0x1f11c3e0 (rare). Interestingly, all >> observed values differ in a single bit only. >> >> The address of the faulting instruction, however, is always the same, >> and it is the entry point of the idle thread: Apparently, this is the >> first time during the testsuite where the idle thread gets scheduled. It >> turns out that the idle thread's CPSR value is hard coded here: >> >> https://github.com/seL4/seL4/blob/master/src/arch/arm/kernel/thread.c#L29 >> >> and its value does not disable async faults. So this is the first time >> code is being executed with async fault exceptions enabled and promptly, >> the Wandboard crashes here. >> >> After disabling async faults for the idle thread as well, the testsuite >> finally completes all the way to the famous "All is well in the >> universe" :-) >> >> Since the observed fault addresses are varying between tests, I guess I >> may have some flaky hardware here. Maybe U-Boot (or at least my version >> thereof) has misconfigured DDR timing in some way. I'll have to look >> into this and -unless someone wants them- I'll keep my patches to myself >> until I find a clean solution. >> >> But for the meantime, I have a kernel that actually runs on the >> Wandboard. And I think I have learned quite a bit from this exercise :-). >> >> Thanks a lot for your help. >> >> Robert >> >> >> >>> ________________________________________ >>> From: Devel [devel-bounces@sel4.systems] on behalf of Robert Kaiser >>> [robert.kai...@hs-rm.de] >>> Sent: Saturday, 21 March 2015 06:20 >>> To: devel@sel4.systems >>> Subject: Re: [seL4] Wandboard Port >>> >>> Hi, >>> >>> Am 19.03.2015 um 09:34 schrieb Robert Kaiser: >>>> Hi Alex, >>>> >>>> Am 18.03.2015 um 23:20 schrieb Alexander Kroh: >>>>> Hi Robert, >>>>> >>>>> Yes, the async abort is caused by access to a physical address which is >>>>> not backed by memory or registers, regardless of virtual address >>>>> translation. >>>> OK, so: iff the page table contains a mapping for user space address >>>> 0x13294, but (due to a bug in the page table initialization) that page >>>> is mapped to a page frame which is not backed by RAM (or ROM), then, an >>>> attempt to execute user code at that address would cause an async abort. >>>> Is that correct? >>>> >>>> If so, it would be great if someone could point me to the code that sets >>>> up the page table entries for the >>>> first user space thread. (I already did an unsuccessful search for this >>>> in the board specific initialization code but I can not say that I fully >>>> understand that code, so I may well have overlooked something.. ) >>>> >>>>> You could try masking IRQs to further isolate the interrupt as the >>>>> trigger. >>> I tried this: result: No interrupt before start of user code, async >>> fault still occurs in the same way as before -> I guess this shows that >>> the interrupt has nothing to do with it. >>> >>>>> Another option is to mask the async abort. You might find additional >>>>> symptoms which will help to identify the issue. >>> Now, that was interesting: After disabling the async abort in user mode >>> (it is always disabled in kernel mode), the board starts executing the >>> test suite! It runs a few tests successfully, but then crashes with a >>> *kernel* data abort when running test "Run threads in domains()". There >>> goes my theory about a memory mapping issue, I guess. But how can it >>> have a kernel mode data abort when it is disabled? >>> >>> Any ideas? >>> >>> Cheers >>> >>> Robert >>> >>> >>>>> - Alex >>>>> >>>>> ________________________________________ >>>>> From: Robert Kaiser [robert.kai...@hs-rm.de] >>>>> Sent: Wednesday, 18 March 2015 19:27 >>>>> To: Alexander Kroh >>>>> Cc: devel@sel4.systems >>>>> Subject: Re: [seL4] Wandboard Port >>>>> >>>>> Hi Alex >>>>> >>>>> Am 16.03.2015 um 02:52 schrieb Alexander Kroh: >>>>>> On Sun, 2015-03-15 at 15:33 +0100, Robert Kaiser wrote: >>>>>>> Am 15.03.2015 um 11:23 schrieb Alexander Kroh: >>>>>>>> Hi Robert, >>>>>>>> >>>>>>>> The FSR value of 0x1c06 represents an asynchronous abort. In this >>>>>>>> case, the address reported cannot be trusted! >>>>>>> [...] >>>>>>>> The abort occurs when a physical address is accessed that has no valid >>>>>>>> backing RAM or device register. >>>>>>> So, could it also happen when accessing a virtual address that is mapped >>>>>>> to an invalid physical address (that might explain what I'm seeing)? >>>>>> The virtual to physical address translation has been completed >>>>>> successfully, else you would get an synchronous abort. The key here is >>>>>> that there was a problem with the underlying physical address. >>>>> Thats what I meant to suggest: If the virtual address is correctly >>>>> translated to a physical address by the MMU, but that physical address >>>>> is not backed by memory or registers, could that also generate this kind >>>>> of exception? >>>>> >>>>>>>> We have had lots of fun with this feature on the SabreLite. Common >>>>>>>> causes are: >>>>>>>> * Accessing device registers that do exist (some devices have voids in >>>>>>>> the middle of their address map). >>>>>>>> * If you (for some reason) map a device with the cacheable attribute, >>>>>>>> all addresses which would be used to fill the cache line must be valid >>>>>>>> (again, watch out for voids). >>>>>>>> * Some UART registers are unavailable when the appropriate enable bits >>>>>>>> are not set. >>>>>>>> >>>>>>>> My advice to you is to check that you are using the correct physical >>>>>>>> address for your device mappings (Including the kernel IRQ controller >>>>>>>> and timer). >>>>>>>> >>>>>>>> Also, the first printf at userspace may trigger the initialisation of >>>>>>>> the default UART (which will be incorrect in your case). >>>>>>>> https://github.com/seL4/libplatsupport/blob/master/plat_include/imx6/platsupport/plat/serial.h#L40 >>>>>>> Thanks for this hint! That would have been the next thing for me to >>>>>>> stumble over. However, quickliy fixing it had no effect on my current >>>>>>> problem. >>>>>>> >>>>>>>> There may also be slight differences in the availability of device >>>>>>>> registers between the 2 SoCs. >>>>>>> Is that really a possibility, given that U-boot reports the same chip >>>>>>> revision on both boards? >>>>>> It is unlikely, but it is still a possibility. Is it only the ARM chip >>>>>> revisions that match or also the i.MX6 chip revisions? >>>>> Hmm, I'm sure I saw exactly the same outputs from both boards at some >>>>> point, however, in the meantime I have re-flashed U-Boot on both of >>>>> them. The situation now is that on the Sabre, U-Boot reports >>>>> >>>>> "CPU: Freescale i.MX6 family TO1.2 at 792 MHz" >>>>> >>>>> while on the wand it says: >>>>> >>>>> "CPU: Freescale i.MX6Q rev1.2 at 792 MHz" >>>>> >>>>> No idea wether that "1.2" refers to the core or the SoC. >>>>> >>>>> >>>>> >>>>>>> [...] >>>>>>> Wish I had a JTAG-debugger.... >>>>>>> >>>>>>> What I am still uncertain about is wether a fault upon entering user >>>>>>> code is to be expected, i.e. do those pages get mapped in by a page >>>>>>> fault handler or are they pre-mapped before the code is invoked? >>>>>> The fault is unexpected. The pages are pre-mapped by the kernel, but >>>>>> again, this is not a virtual memory mapping issue. >>>>>> However, one thing that is typical is the occurrence of an IRQ exception >>>>>> as soon as the mode switch to user space occurs. >>>>> Indeed, that happens! I'm consistently seeing a timer interrupt at this >>>>> point. Probably it has been pending for a while and fires as soon as the >>>>> interrupt mask is dropped. Apart from its housekeeping work, this timer >>>>> ISR does a few hardware accesses to the "private timer" and the >>>>> interrupt controller (both components, as I understand, are part of the >>>>> A9 core). >>>>> >>>>> I tried putting isb/dmb and dsb instructions right after these hardware >>>>> accesses, hoping this might change the behaviour in some way, thus >>>>> indicating which of them triggered the async fault. Alas, no effect at >>>>> all :-(. >>>>> >>>>>> One thing to try is to insert an "isb" instruction just before switching >>>>>> to user space. This will ensure that all memory accesses are completed >>>>>> before continuing and it will force the asynchronous abort to occur at >>>>>> this instruction rather than some future instruction, when the >>>>>> load/store buffer finally drains. >>>>>> You should also add an isb here in case you are returning from an IRQ: >>>>>> https://github.com/seL4/seL4/blob/master/src/arch/arm/traps.S#L49 >>>>> I also tried this. And I tried sequences of dmb, dsb and isb >>>>> instructions. All of this had no visible effect. The behaivour stays the >>>>> same all the time: upon leaving privileged mode, the interrupt fires, >>>>> gets serviced, then the async fault happens. I know the fault address >>>>> can not be trusted, but it never changed during these experiments. No >>>>> matter where in the ISR or else i placed those isb instructions, it >>>>> always pointed to the entry point of the user code. >>>>> >>>>> Any suggestions how to further systematically pinpoint this problem? >>>>> >>>>> Thanks in advance for any help. >>>>> >>>>> Robert >>>>> >>>>>> - Alex >>>>>> >>>>>> >>>>>>> Again, thanks for any help >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> Robert >>>>>>> >>>>>>> >>>>>>> >>>>>>>> - Alex >>>>>>>> >>>>>>>> >>>>>>>> ________________________________________ >>>>>>>> From: Devel [devel-bounces@sel4.systems] on behalf of Robert Kaiser >>>>>>>> [robert.kai...@hs-rm.de] >>>>>>>> Sent: Sunday, 15 March 2015 19:03 >>>>>>>> To: devel@sel4.systems >>>>>>>> Subject: [seL4] Wandboard Port >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> in an attempt to familiarize myself with the seL4 code, I am trying to >>>>>>>> "port" it to the Wandboard (see www.wandboard.org). This should be an >>>>>>>> easy task for a beginner (thought I) since the board is very similar to >>>>>>>> the SabeLite, and seL4 is already running well on that board. I have >>>>>>>> access to a SabreLite and a Wandboard Quad, both (according to U-boot) >>>>>>>> have the same revision of the iMX6 SoC installed. >>>>>>>> >>>>>>>> Differences between the Sabre and the Wand I have noticed so far are: >>>>>>>> >>>>>>>> - 2GB of RAM from (0x10000000 to 0x90000000) on the Wand (Sabrelite >>>>>>>> has 1GB) >>>>>>>> - Wand uses UART1 for debug output, Sabrelite: UART2 >>>>>>>> >>>>>>>> I compiled an sel4test project where I adapted the UART port in >>>>>>>> kernel/include/plat/imx6/plat/machine/devices.h and >>>>>>>> elfloader/src/arch-arm/plat-imx6/platform.h and the RAM size in kernel >>>>>>>> src/plat/imx6/machine/hardware.c. When I boot this system, I get: >>>>>>>> >>>>>>>> Jumping to kernel-image entry point... >>>>>>>> Bootstrapping kernel >>>>>>>> Caught cap fault in send phase at address 0x0 >>>>>>>> while trying to handle: >>>>>>>> vm fault on data at address 0x9f11c2e0 with status 0x1c06 >>>>>>>> in thread 0xffdfad00 at address 0x13294 >>>>>>>> >>>>>>>> (Needless to say, "all is well in the universe" on the SabreLite... ) >>>>>>>> What is not shown here are a ton of other debug messages which I have >>>>>>>> added to convince myself that kernel initialization completes as >>>>>>>> expected. The crash seems to happen upon entry into user code. The >>>>>>>> address 0x13294 is the virtual address of the entry point: >>>>>>>> >>>>>>>> $ nm build/arm/imx6/sel4test-driver/sel4test-driver.bin | grep 13294 >>>>>>>> 00013294 T _sel4_start >>>>>>>> >>>>>>>> I suspect that this fault happens on opcode fetch, because the user >>>>>>>> code >>>>>>>> is not properly mapped when invoked. Does "status 0x1c06" confirm this? >>>>>>>> >>>>>>>> If so, *should* the code be mapped at this point or are these mappings >>>>>>>> expected to be installed "on demand", i.e. through page fault handling? >>>>>>>> >>>>>>>> Thanks for any help... >>>>>>>> >>>>>>>> Robert >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Robert Kaiser >>>>>>>> Computer Engineering >>>>>>>> RheinMain University of Applied Sciences >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Devel mailing list >>>>>>>> Devel@sel4.systems >>>>>>>> https://sel4.systems/lists/listinfo/devel >>>>>>>> >>>>>>>> ________________________________ >>>>>>>> >>>>>>>> The information in this e-mail may be confidential and subject to >>>>>>>> legal professional privilege and/or copyright. National ICT Australia >>>>>>>> Limited accepts no liability for any damage caused by this email or >>>>>>>> its attachments. >>>>> -- >>>>> Prof. Dr. Robert Kaiser >>>>> >>>>> Technische Informatik >>>>> Hochschule RheinMain >>>>> Wiesbaden Rüsselsheim >>>>> >>>>> Computer Engineering >>>>> RheinMain University of Applied Sciences >>>>> >>>>> robert.kai...@hs-rm.de >>>>> http://www.cs.hs-rm.de/~kaiser >>>>> >>>>> tel:(+49)611-9495-1292 >>>>> fax:(+49)611-9495-1210 >>>>> >>>>> Postanschrift/Postal Address: >>>>> Robert Kaiser, Hochschule RheinMain, FB DCSM/Informatik >>>>> Unter den Eichen 5, 65195 Wiesbaden, Germany >>>>> >>>>> >>> -- >>> Robert Kaiser >>> >>> Computer Engineering >>> RheinMain University of Applied Sciences >>> >>> >>> >>> _______________________________________________ >>> Devel mailing list >>> Devel@sel4.systems >>> https://sel4.systems/lists/listinfo/devel >> >> -- >> Robert Kaiser >> >> Computer Engineering >> RheinMain University of Applied Sciences > > -- > Robert Kaiser > > Computer Engineering > RheinMain University of Applied Sciences > -- Prof. Dr. Robert Kaiser Technische Informatik Hochschule RheinMain Wiesbaden Rüsselsheim Computer Engineering RheinMain University of Applied Sciences robert.kai...@hs-rm.de http://www.cs.hs-rm.de/~kaiser tel:(+49)611-9495-1292 fax:(+49)611-9495-1210 Postanschrift/Postal Address: Robert Kaiser, Hochschule RheinMain, FB DCSM/Informatik Unter den Eichen 5, 65195 Wiesbaden, Germany _______________________________________________ Devel mailing list Devel@sel4.systems https://sel4.systems/lists/listinfo/devel