tmedicci commented on PR #16673:
URL: https://github.com/apache/nuttx/pull/16673#issuecomment-3188301312

   @jlaitine Just tested it and we still have the same issue. Let me try to 
explain it further.
   
   Currently, we are testing 
[this](https://github.com/tmedicci/nuttx/tree/fix_smp_scheduling) branch:  It's 
basically your branch + the suggested change about the spiflash driver tasks 
(https://github.com/tmedicci/nuttx/commit/95150eae198742196a63b04c18a5ffd3c204668f).
 It still fails to run the same example I reported yesterday (here: 
https://github.com/apache/nuttx/pull/16673#pullrequestreview-3117567657). The 
device initializes, but the `nsh>` freezes.
   
   By checking at the debugger, I could see that only one of the 
`spi_flash_op_block_task` ran in the first core. The same branch didn't run at 
all in the second core. Interestingly, by halting the device after typing 
anything, I can check the following values for `g_readytorun` and 
`g_assignedtasks`:
   ```
   +p g_readytorun
   $1 = {head = 0x3fc92578, tail = 0x3fc911a8}
   +p g_assignedtasks
   $2 = {0x3fc8ae00 <g_idletcb>, 0x3fc8aed4 <g_idletcb+212>}
   ```
   
   Please note that both cores are running the idle thread (`g_assignedtasks`), 
but we have two pending tasks at `g_readytorun` and, curiously, the head TCB's 
(`0x3fc92578`) refers to the `spi_flash_op_block_task` task that should have 
run in the second core. The other task at tail (`0x3fc911a8`) refers to the 
UART handler that should process the typed character and echoes it back to nsh.
   
   > I didn't quite understand what you mean with inner/outer loops etc; so 
maybe share the modified spiflash driver code to help me understand the issue 
better?
   
   I referred to the loop that creates both `spi_flash_op_block_task` 
[here](https://github.com/tmedicci/incubator-nuttx/blob/95150eae198742196a63b04c18a5ffd3c204668f/arch/xtensa/src/esp32s3/esp32s3_spiflash.c#L1465).
 Please check that the scheduler is locked before the loop starts (and unlocked 
when it finishes). Also, please note that - as suggested - we protected the 
thread's creation/affinity within the critical section at 
[`spiflash_init_spi_flash_op_block_task`](https://github.com/tmedicci/incubator-nuttx/blob/95150eae198742196a63b04c18a5ffd3c204668f/arch/xtensa/src/esp32s3/esp32s3_spiflash.c#L1013).
   
   In order to test it, I applied the following patch:
   ```
   diff --git a/arch/xtensa/src/esp32s3/esp32s3_spiflash.c 
b/arch/xtensa/src/esp32s3/esp32s3_spiflash.c
   index b34cead684..f83066674a 100644
   --- a/arch/xtensa/src/esp32s3/esp32s3_spiflash.c
   +++ b/arch/xtensa/src/esp32s3/esp32s3_spiflash.c
   @@ -1004,14 +1004,11 @@ static int spiflash_init_spi_flash_op_block_task(int 
cpu)
      char *argv[2];
      char arg1[32];
      cpu_set_t cpuset;
   -  irqstate_t flags;
    
      snprintf(arg1, sizeof(arg1), "%p", &cpu);
      argv[0] = arg1;
      argv[1] = NULL;
    
   -  flags = enter_critical_section();
   -
      pid = kthread_create("spiflash_op",
                           SCHED_PRIORITY_MAX,
                           CONFIG_ESP32S3_SPIFLASH_OP_TASK_STACKSIZE,
   @@ -1031,8 +1028,6 @@ static int spiflash_init_spi_flash_op_block_task(int 
cpu)
          ret = -EPERM;
        }
    
   -  leave_critical_section(flags);
   -
      return ret;
    }
    #endif /* CONFIG_SMP */
   @@ -1460,7 +1455,7 @@ int esp32s3_spiflash_init(void)
      nxrmutex_init(&g_flash_op_mutex);
    
    #ifdef CONFIG_SMP
   -  sched_lock();
   +  irqstate_t flags = enter_critical_section();
    
      for (cpu = 0; cpu < CONFIG_SMP_NCPUS; cpu++)
        {
   @@ -1473,7 +1468,7 @@ int esp32s3_spiflash_init(void)
            }
        }
    
   -  sched_unlock();
   +  leave_critical_section(flags);
    #else
      UNUSED(cpu);
    #endif
   ```
   
   Just removed the critical section from 
`spiflash_init_spi_flash_op_block_task` and protected the whole 
`spiflash_init_spi_flash_op_block_task` block outside the loop, by replacing 
the `sched_lock/unlock` for the critical section. **That worked!**
   
   So, in summary, although this last patch made it work, it still seems to 
have something wrong with `sched_lock/unlock` because - without this last patch 
- the `spi_flash_op_block_task` for the second core remains forever on 
`g_readytorun` and prevents other tasks on that list from running too... 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to