Hi Nathan, I just ran into this, this week on all the H7 boards in PX4.
This was broken by https://github.com/apache/incubator-nuttx/pull/459/files#diff-8e09d7a85a1cc6cc65e0bb43d7aa751e3caaec9f9d5e824f9b741a1487ec9199L117-L138 I did not catch the implication of that change at the time. The original idea we to leave sram4 out when the CONFIG_MM_REGIONS was 3 The side effect benefit of the ordering was that tasks allocated first used DTCM :). But that is not very deterministic. Adding the dtcm heap is a move in the better direction. Then set CONFIG_MM_REGIONS to 1. However, not having a "SPEED" parameter to allocate makes using the dtcm very arch/chip specific. As you have noted checking at build time only works for in tree configurations. The quick fix to add an CONFIG_..._EXCLUDE_SRAM4... (naming TBD) and light it in things that select BDMA to SRAM4 would a good to addition, and help in some cases, but selecting it just on BDMA could be wasteful. Maybe that is just a warning if SRAM4 is added to the heap when BDMA is enabled. These more complex muti-power domain Soc, require more floor planning of resource allocation on the board designer and users part. I think a more flexible approach is would be to rework the heap addition logic to be selectable and orderable. Then a board designer can choose to add the RAM blocks to the heap and order them as desired. CONFIG_...ADD_ORDER_SRAM CONFIG_... ADD_ORDER _SRAM12.. CONFIG_... ADD_ORDER _ SRAM3.. CONFIG_... ADD_ORDER _SRAM4.. CONFIG_... ADD_ORDER _DTCM.. (naming TBD) 0 is not used. 1 first added, N last added. This gets defined into an array with the start addresses and lengths. The code then processes them in a loop until all are added. I think this would give us the all in one solution and not force the solution back to calling add region in late init per board. In the case of SRAM4 it is small and I do not think there is value in carving it up with static and partial heap. (But when you are out of memory times get desperate) I spent quite a bit of time mulling over using the granual allocator and placing the buffers in the drivers. My conclusion was this: For the granual allocator you have to give it the static memory large enough to meet ALL demand. (this creates a runtime size dependency the user has to manage) After init, the memory will be given out to the drives and that is dependent on device messages sizes (i.e MAX SPI payload which is for the most part static, unless you have PNP SPI) So once the memory is allocated out, it is the same as the static allocation of buffers in the drivers. It is more code, more complexity, can fail at run time for out of memory or DMA/Dcache alignment reasons. The compiler telling you SRAM4 is full is a lot easier to debug :) The other issues is there is no way to pass a buffer for 2 bytes to a driver and DMA/Dcache validate it and fall back to non DMA. The in driver buffers do add a mem copy but given the DMA/Dcache alignment issues, and after doing the timing analysis is it very small addition compared to the actual IO time. David -----Original Message----- From: Nathan Hartman [mailto:hartman.nat...@gmail.com] Sent: Thursday, March 25, 2021 4:28 PM To: dev@nuttx.apache.org Subject: How to ensure HEAP will not overlap static DMA buffer? tl;dr: How do I statically allocate DMA buffers and prevent the heap from overlapping them? Details: On STM32H7, BDMA can only reach SRAM4 (64 KB starting at 0x3800:0000). The armv7m DCache line size means the buffers must be allocated on a cache line boundary. So a DMA buffer must meet 2 requirements: (1) be in SRAM4, and (2) be aligned and padded to 32 byte increments. Ok, I can do this easily with: static volatile uint8_t my_bdma_buf[128] __attribute__ ((section (".sram4"))) __attribute__ ((aligned (32))); or equivalent. But then the heap will overlap this memory and my buffer will be clobbered because arch/arm/src/stm32h7/stm32_allocateheap.c arm_addregion() does this: addregion (SRAM4_START, SRAM4_END - SRAM4_START, "SRAM4"); Looking into it, I see that in arch/arm/src/stm32h7/stm32_spi.c, g_spi6_txbuf[] and g_spi6_rxbuf[] are declared like this: static uint8_t g_spi6_txbuf[SPI6_DMABUFSIZE_ADJUSTED] SPI6_DMABUFSIZE_ALGN locate_data(".sram4"); static uint8_t g_spi6_rxbuf[SPI6_DMABUFSIZE_ADJUSTED] SPI6_DMABUFSIZE_ALGN locate_data(".sram4"); So they will suffer that same problem. (I'm pretty sure that's a bug, unless I missed something.) I discovered this the hard way, i.e., on my board, NSH suddenly wouldn't start. In reality, it does start but it turns out that when it tries to print the greeting message, it fails on the very first character ('\n') and exits. Why? I single-stepped all the way into the guts of the serial logic, and found that in fs/vfs/fs_write.c, file_write(), it fetches the inode and tests if it's valid (!inode || !inode->u.i_ops || !inode->u.i_ops->write) or returns -EBADF. That's where it fails, because it seems that the inode is allocated at 0x3800:0070, right on top of my BDMA buffers! I don't know which question to ask, but I think it's one of these: (1) Is there a standard(-ish) way to allocate a buffer that is both aligned to a specified boundary AND in a specified region of memory? (2) Is there a standard way to make a hole in the heap so that nothing else in NuttX will overwrite the DMA buffer? (Yes, I see that I can fiddle with CONFIG_MM_REGIONS, but then only the 512 KB of AXI SRAM will be in the heap, and none of the other regions, SRAM1,2,3, etc., and I don't think I want that.) Thanks, Nathan