Re: [seL4] Wandboard Port

Robert Kaiser Wed, 29 Jul 2015 09:42:08 -0700

Hi Alex

Am 29.07.2015 um 03:19 schrieb Alexander Kroh:
> Hi Robert,
> 
> Did you end up getting to the bottom of this issue?
>


No, unfortunately I got somewhat overwhelmed with other stuff :-(

> We still have not fixed it... But the good news is that we have finally been 
> able to reproduce it!
> We find the same symptoms on the Sabre Light board with a different version 
> of u-boot
> 

That would support my theory that Uboot somehow causes the error.

Do you know if there is a way to just clear any stale pending data abort
conditions during startup prior to dropping the mask?

Robert

>   - Alex
> 
> ________________________________________
> From: Robert Kaiser [robert.kai...@hs-rm.de]
> Sent: Monday, 23 March 2015 08:38
> To: Alexander Kroh; devel@sel4.systems
> Subject: Re: [seL4] Wandboard Port
> 
> Hi Alex,
> 
> Am 22.03.2015 um 07:08 schrieb Alexander Kroh:
>> Hi Robert,
>>
>> Thanks for pointing that out, we will fix those hard coded values ASAP.
> Attached is a patch that should do this. Hope its OK.
> 
> 
>>
>> From the B3.13.4 in the ARMv7 manual, in the case of an async abort, the 
>> DFAR (fault address) can't be trusted either.
>> It seems like the only way to get to the bottom of this is to use "isb" 
>> instructions to force any pending async abort to occur before execution can 
>> continue to an arbitrary point.
>> Unless there is a way to turn off the load/store buffers?
> 
> It seems to me like the async abort signal is completely bogus in my
> case: It  hits immediatly as soon as the mask is dropped in CPSR, no
> matter what the CPU is doing. The idle thread that crashed in the last
> test is simply an endless loop that does not load or store any data. All
> it does is opcode fetches. And if the async abort is kept masked, the
> entire testsuite completes successfully. I doubt this would be the case
> if loads or stores really failed (which -after all- is what the async
> abort is supposed to indicate, as far as I understand).
> 
> My current hypothesis is that this is due to some weird misconfiguration
> by U-boot.
> 
> 
>>
>> I will be very interested to hear what the underlying cause was when you 
>> work this one out :)
> I hope i will ....
> 
> 
> Cheers
> 
> Robert
> 
> 
>>   - Alex
>>
>>
>> ________________________________________
>> From: Robert Kaiser [robert.kai...@hs-rm.de]
>> Sent: Sunday, 22 March 2015 00:40
>> To: Alexander Kroh; devel@sel4.systems
>> Subject: Re: [seL4] Wandboard Port
>>
>> Hi,
>>
>> Am 21.03.2015 um 04:52 schrieb Alexander Kroh:
>>> You have only disabled async aborts. It sounds like the kernel is following 
>>> a bad pointer which leads to a translation failure. The good news is that 
>>> the fault address provided can actually be trusted.
>>> Note that when you mask async aborts, writes to the invalid physical 
>>> address are ignored and reads are always 0. It the kernel following a NULL 
>>> pointer?
>> Here are the final words of a test run:
>>
>> ---------------------
>> </system-out>
>>         </testcase>
>>         <testcase classname="sel4test" name="TEST_DOMAINS0004">
>> INFO :sel4utils_elf_load_record_regions:270:  * Loading segment
>> 00008000-->0004$
>> INFO :sel4utils_elf_load_record_regions:270:  * Loading segment
>> 0004d170-->0016$
>> Running test DOMAINS0004 (Run threads in domains())
>>
>>
>> KERNEL DATA ABORT!
>> Faulting instruction: 0xe00107d8
>> FAR: 0x1f11c2e0 DFSR: 0x1c06
>> halting...
>> ---------------------
>>
>> The fault address (FAR) is never NULL.  Worse even, its value varies
>> from test to test. Values I have seen are: 0x1f11c2e0 (most of the
>> time), 0x9f11c2e0 (often), 0x1f11c3e0 (rare). Interestingly, all
>> observed values differ in a single bit only.
>>
>> The address of the faulting instruction, however, is always the same,
>> and it is the entry point of the idle thread: Apparently, this is the
>> first time during the testsuite where the idle thread gets scheduled. It
>> turns out that the idle thread's CPSR value is hard coded here:
>>
>> https://github.com/seL4/seL4/blob/master/src/arch/arm/kernel/thread.c#L29
>>
>> and its value does not disable async faults. So this is the first time
>> code is being executed with async fault exceptions enabled and promptly,
>> the Wandboard crashes here.
>>
>> After disabling async faults for the idle thread as well, the testsuite
>> finally completes all the way to the famous "All is well in the
>> universe" :-)
>>
>> Since the observed fault addresses are varying between tests, I guess I
>> may have some flaky hardware here. Maybe U-Boot (or at least my version
>> thereof) has misconfigured DDR timing in some way. I'll have to look
>> into this and -unless someone wants them- I'll keep my patches to myself
>> until I find a clean solution.
>>
>> But for the meantime, I have a kernel that actually runs on the
>> Wandboard. And I think I have learned quite a bit from this exercise :-).
>>
>> Thanks a lot for your help.
>>
>> Robert
>>
>>
>>
>>> ________________________________________
>>> From: Devel [devel-bounces@sel4.systems] on behalf of Robert Kaiser 
>>> [robert.kai...@hs-rm.de]
>>> Sent: Saturday, 21 March 2015 06:20
>>> To: devel@sel4.systems
>>> Subject: Re: [seL4] Wandboard Port
>>>
>>> Hi,
>>>
>>> Am 19.03.2015 um 09:34 schrieb Robert Kaiser:
>>>> Hi Alex,
>>>>
>>>> Am 18.03.2015 um 23:20 schrieb Alexander Kroh:
>>>>> Hi Robert,
>>>>>
>>>>> Yes, the async abort is caused by access to a physical address which is 
>>>>> not backed by memory or registers, regardless of virtual address 
>>>>> translation.
>>>> OK, so: iff the page table contains a mapping for user space address
>>>> 0x13294, but (due to a bug in the page table initialization) that page
>>>> is mapped to a page frame which is not backed by RAM (or ROM), then, an
>>>> attempt to execute user code at that address would cause an async abort.
>>>> Is that correct?
>>>>
>>>> If so, it would be great if someone could point me to the code that sets
>>>> up the page table entries for the
>>>> first user space thread. (I already did an unsuccessful search for this
>>>> in the board specific initialization code but I can not say that I fully
>>>> understand that code, so I may well have overlooked something.. )
>>>>
>>>>> You could try masking IRQs to further isolate the interrupt as the 
>>>>> trigger.
>>> I tried this: result: No interrupt before start of user code, async
>>> fault still occurs in the same way as before -> I guess this shows that
>>> the interrupt has nothing to do with it.
>>>
>>>>> Another option is to mask the async abort. You might find additional 
>>>>> symptoms which will help to identify the issue.
>>> Now, that was interesting: After disabling the async abort in user mode
>>> (it is always disabled in kernel mode), the board starts executing the
>>> test suite! It runs a few tests successfully, but then crashes with a
>>> *kernel* data abort when running test "Run threads in domains()". There
>>> goes my theory about a memory mapping issue, I guess. But how can it
>>> have a kernel mode data abort when it is disabled?
>>>
>>> Any ideas?
>>>
>>> Cheers
>>>
>>> Robert
>>>
>>>
>>>>>   - Alex
>>>>>
>>>>> ________________________________________
>>>>> From: Robert Kaiser [robert.kai...@hs-rm.de]
>>>>> Sent: Wednesday, 18 March 2015 19:27
>>>>> To: Alexander Kroh
>>>>> Cc: devel@sel4.systems
>>>>> Subject: Re: [seL4] Wandboard Port
>>>>>
>>>>> Hi Alex
>>>>>
>>>>> Am 16.03.2015 um 02:52 schrieb Alexander Kroh:
>>>>>> On Sun, 2015-03-15 at 15:33 +0100, Robert Kaiser wrote:
>>>>>>> Am 15.03.2015 um 11:23 schrieb Alexander Kroh:
>>>>>>>> Hi Robert,
>>>>>>>>
>>>>>>>> The FSR value of 0x1c06 represents an asynchronous abort. In this 
>>>>>>>> case, the address reported cannot be trusted!
>>>>>>> [...]
>>>>>>>> The abort occurs when a physical address is accessed that has no valid 
>>>>>>>> backing RAM or device register.
>>>>>>> So, could it also happen when accessing a virtual address that is mapped
>>>>>>> to an invalid physical address (that might explain what I'm seeing)?
>>>>>> The virtual to physical address translation has been completed
>>>>>> successfully, else you would get an synchronous abort. The key here is
>>>>>> that there was a problem with the underlying physical address.
>>>>> Thats what I meant to suggest: If the virtual address is correctly
>>>>> translated to a physical address by the MMU, but that physical address
>>>>> is not backed by memory or registers, could that also generate this kind
>>>>> of exception?
>>>>>
>>>>>>>>  We have had lots of fun with this feature on the SabreLite. Common 
>>>>>>>> causes are:
>>>>>>>> * Accessing device registers that do exist (some devices have voids in 
>>>>>>>> the middle of their address map).
>>>>>>>> * If you (for some reason) map a device with the cacheable attribute, 
>>>>>>>> all addresses which would be used to fill the cache line must be valid 
>>>>>>>> (again, watch out for voids).
>>>>>>>> * Some UART registers are unavailable when the appropriate enable bits 
>>>>>>>> are not set.
>>>>>>>>
>>>>>>>> My advice to you is to check that you are using the correct physical 
>>>>>>>> address for your device mappings (Including the kernel IRQ controller 
>>>>>>>> and timer).
>>>>>>>>
>>>>>>>> Also, the first printf at userspace may trigger the initialisation of 
>>>>>>>> the default UART (which will be incorrect in your case).
>>>>>>>> https://github.com/seL4/libplatsupport/blob/master/plat_include/imx6/platsupport/plat/serial.h#L40
>>>>>>> Thanks for this hint! That would have been the next thing for me to
>>>>>>> stumble over. However, quickliy fixing it had no effect on my current
>>>>>>> problem.
>>>>>>>
>>>>>>>> There may also be slight differences in the availability of device 
>>>>>>>> registers between the 2 SoCs.
>>>>>>> Is that really a possibility, given that U-boot reports the same chip
>>>>>>> revision on both boards?
>>>>>> It is unlikely, but it is still a possibility. Is it only the ARM chip
>>>>>> revisions that match or also the i.MX6 chip revisions?
>>>>> Hmm, I'm sure I saw exactly the same outputs from both boards at some
>>>>> point, however, in the meantime I have re-flashed U-Boot on both of
>>>>> them. The situation now is that on the Sabre, U-Boot reports
>>>>>
>>>>> "CPU: Freescale i.MX6 family TO1.2 at 792 MHz"
>>>>>
>>>>> while on the wand it says:
>>>>>
>>>>> "CPU:   Freescale i.MX6Q rev1.2 at 792 MHz"
>>>>>
>>>>> No idea wether that "1.2" refers to the core or the SoC.
>>>>>
>>>>>
>>>>>
>>>>>>> [...]
>>>>>>> Wish I had a JTAG-debugger....
>>>>>>>
>>>>>>> What I am still uncertain about is wether a fault upon entering user
>>>>>>> code is to be expected, i.e. do those pages get mapped in by a page
>>>>>>> fault handler or are they pre-mapped before the code is invoked?
>>>>>> The fault is unexpected. The pages are pre-mapped by the kernel, but
>>>>>> again, this is not a virtual memory mapping issue.
>>>>>> However, one thing that is typical is the occurrence of an IRQ exception
>>>>>> as soon as the mode switch to user space occurs.
>>>>> Indeed, that happens! I'm consistently seeing a timer interrupt at this
>>>>> point. Probably it has been pending for a while and fires as soon as the
>>>>> interrupt mask is dropped. Apart from its housekeeping work, this timer
>>>>> ISR does a few hardware accesses to the "private timer"  and the
>>>>> interrupt controller (both components, as I understand, are part of the
>>>>> A9 core).
>>>>>
>>>>> I tried putting isb/dmb and dsb instructions right after these hardware
>>>>> accesses, hoping this might change the behaviour  in some way, thus
>>>>> indicating which of them  triggered the async fault. Alas, no effect at
>>>>> all :-(.
>>>>>
>>>>>> One thing to try is to insert an "isb" instruction just before switching
>>>>>> to user space. This will ensure that all memory accesses are completed
>>>>>> before continuing and it will force the asynchronous abort to occur at
>>>>>> this instruction rather than some future instruction, when the
>>>>>> load/store buffer finally drains.
>>>>>> You should also add an isb here in case you are returning from an IRQ:
>>>>>> https://github.com/seL4/seL4/blob/master/src/arch/arm/traps.S#L49
>>>>> I also tried this. And I tried sequences of dmb, dsb and isb
>>>>> instructions. All of this had no visible effect. The behaivour stays the
>>>>> same all the time: upon leaving privileged mode, the interrupt fires,
>>>>> gets serviced, then the async fault happens. I know the fault address
>>>>> can not be trusted, but it never changed during these experiments. No
>>>>> matter where in the ISR or else i placed those isb instructions, it
>>>>> always pointed to the entry point of the user code.
>>>>>
>>>>> Any suggestions how to further systematically pinpoint this problem?
>>>>>
>>>>> Thanks in advance for any help.
>>>>>
>>>>> Robert
>>>>>
>>>>>>  - Alex
>>>>>>
>>>>>>
>>>>>>> Again, thanks for any help
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Robert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  - Alex
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: Devel [devel-bounces@sel4.systems] on behalf of Robert Kaiser 
>>>>>>>> [robert.kai...@hs-rm.de]
>>>>>>>> Sent: Sunday, 15 March 2015 19:03
>>>>>>>> To: devel@sel4.systems
>>>>>>>> Subject: [seL4] Wandboard Port
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> in an attempt to familiarize myself with the seL4 code, I am trying to
>>>>>>>> "port" it to the Wandboard (see www.wandboard.org). This should be an
>>>>>>>> easy task for a beginner (thought I) since the board is very similar to
>>>>>>>> the SabeLite, and seL4 is already running well on that board. I have
>>>>>>>> access to a SabreLite and a Wandboard Quad, both (according to U-boot)
>>>>>>>> have the same revision of the iMX6 SoC installed.
>>>>>>>>
>>>>>>>> Differences between the Sabre and the Wand I have noticed so far are:
>>>>>>>>
>>>>>>>> - 2GB of RAM from (0x10000000 to 0x90000000) on the Wand (Sabrelite 
>>>>>>>> has 1GB)
>>>>>>>> - Wand uses UART1 for debug output, Sabrelite: UART2
>>>>>>>>
>>>>>>>> I compiled an sel4test project where I adapted the UART port in
>>>>>>>> kernel/include/plat/imx6/plat/machine/devices.h and
>>>>>>>> elfloader/src/arch-arm/plat-imx6/platform.h and the RAM size in kernel
>>>>>>>> src/plat/imx6/machine/hardware.c. When I boot this system, I get:
>>>>>>>>
>>>>>>>> Jumping to kernel-image entry point...
>>>>>>>> Bootstrapping kernel
>>>>>>>> Caught cap fault in send phase at address 0x0
>>>>>>>> while trying to handle:
>>>>>>>> vm fault on data at address 0x9f11c2e0 with status 0x1c06
>>>>>>>> in thread 0xffdfad00 at address 0x13294
>>>>>>>>
>>>>>>>> (Needless to say, "all is well in the universe" on the SabreLite... )
>>>>>>>> What is not shown here are a ton of other debug messages which I have
>>>>>>>> added to convince myself that kernel initialization completes as
>>>>>>>> expected. The crash seems to happen upon entry into user code. The
>>>>>>>> address 0x13294 is the virtual address of the entry point:
>>>>>>>>
>>>>>>>> $ nm build/arm/imx6/sel4test-driver/sel4test-driver.bin | grep 13294
>>>>>>>> 00013294 T _sel4_start
>>>>>>>>
>>>>>>>> I suspect that this fault happens on opcode fetch, because the user 
>>>>>>>> code
>>>>>>>> is not properly mapped when invoked. Does "status 0x1c06" confirm this?
>>>>>>>>
>>>>>>>> If so, *should* the code be mapped at this point or are these mappings
>>>>>>>> expected to be installed "on demand", i.e. through page fault handling?
>>>>>>>>
>>>>>>>> Thanks for any help...
>>>>>>>>
>>>>>>>> Robert
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Robert Kaiser
>>>>>>>> Computer Engineering
>>>>>>>> RheinMain University of Applied Sciences
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Devel mailing list
>>>>>>>> Devel@sel4.systems
>>>>>>>> https://sel4.systems/lists/listinfo/devel
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>>
>>>>>>>> The information in this e-mail may be confidential and subject to 
>>>>>>>> legal professional privilege and/or copyright. National ICT Australia 
>>>>>>>> Limited accepts no liability for any damage caused by this email or 
>>>>>>>> its attachments.
>>>>> --
>>>>> Prof. Dr. Robert Kaiser
>>>>>
>>>>> Technische Informatik
>>>>> Hochschule RheinMain
>>>>> Wiesbaden Rüsselsheim
>>>>>
>>>>> Computer Engineering
>>>>> RheinMain University of Applied Sciences
>>>>>
>>>>> robert.kai...@hs-rm.de
>>>>> http://www.cs.hs-rm.de/~kaiser
>>>>>
>>>>> tel:(+49)611-9495-1292
>>>>> fax:(+49)611-9495-1210
>>>>>
>>>>> Postanschrift/Postal Address:
>>>>> Robert Kaiser, Hochschule RheinMain, FB DCSM/Informatik
>>>>> Unter den Eichen 5, 65195 Wiesbaden, Germany
>>>>>
>>>>>
>>> --
>>> Robert Kaiser
>>>
>>> Computer Engineering
>>> RheinMain University of Applied Sciences
>>>
>>>
>>>
>>> _______________________________________________
>>> Devel mailing list
>>> Devel@sel4.systems
>>> https://sel4.systems/lists/listinfo/devel
>>
>> --
>> Robert Kaiser
>>
>> Computer Engineering
>> RheinMain University of Applied Sciences
> 
> --
> Robert Kaiser
> 
> Computer Engineering
> RheinMain University of Applied Sciences
> 

-- 
Prof. Dr. Robert Kaiser

Technische Informatik
Hochschule RheinMain
Wiesbaden Rüsselsheim

Computer Engineering
RheinMain University of Applied Sciences

robert.kai...@hs-rm.de
http://www.cs.hs-rm.de/~kaiser

tel:(+49)611-9495-1292
fax:(+49)611-9495-1210

Postanschrift/Postal Address:
Robert Kaiser, Hochschule RheinMain, FB DCSM/Informatik
Unter den Eichen 5, 65195 Wiesbaden, Germany



_______________________________________________
Devel mailing list
Devel@sel4.systems
https://sel4.systems/lists/listinfo/devel

Re: [seL4] Wandboard Port

Reply via email to