On 2018-08-08 15:12, Ralf Ramsauer wrote:
Hi Jan,

On 08/08/2018 08:21 AM, Jan Kiszka wrote:
On 2018-08-07 12:42, Ralf Ramsauer wrote:
Hi,

On 08/06/2018 04:33 PM, Ralf Ramsauer wrote:
[snip]

On 07/24/2018 05:58 PM, Jan Kiszka wrote:

Don't know if I understand that race correctly: The problem is that a
CPU which is already suspended might be suspended again for some other
reason, right? To resolve this, we would have to synchronously wait in
the second caller of arch_suspend_cpu(), until the first suspension is
handled. Hmm, sounds more like a problem to be solved with a
semaphore,
rather than a lock: can there be a third caller of suspend_cpu?

I didn't think all cases through, in fact, and there might be also no
issue because of the control_lock protection. However, understanding
the
logic will definitely be easier if you know that after arch_resume_cpu
returned, the target CPU left the event loop for sure.

I've got a question on the control_lock/cpu_suspend/suspend_cpu logic:

Let's say we have a scenario with three CPUs, one target, two callers.
Both callers want to suspend-resume the target.

Note that the existing suspend/resume logic is not designed for
concurrent usage. That's one reason why we suspend the root cell when
doing operations on other cells.

I think it should be designed for concurrent use, because there could be
concurrent misuse. Example: Think of the (silly) case, where two CPUs of
the root-cell try to concurrently destroy a cell. Or in general, try to
invoke concurrent management actions. E.g., via two parallel jailhouse
loads (the HV cell_set_loadable path). That's stupid but possible.

Try this snippet on QEMU x86. It simply parallelises cell destroy calls
in a way to have a higher chance hit the window. It triggers the race in
~3 of 5 runs on my system:

diff --git a/driver/cell.c b/driver/cell.c
index 67edb732..8c6e6c8b 100644
--- a/driver/cell.c
+++ b/driver/cell.c
@@ -399,12 +399,25 @@ int jailhouse_cmd_cell_start(const char __user *arg)
         return err;
  }

+static int tmp_err;
+
+static void trigger_race_via_cell_destroy(void *info)
+{
+       unsigned int cell_id = *(unsigned int*)info;
+       int err;
+       err = jailhouse_call_arg1(JAILHOUSE_HC_CELL_DESTROY, cell_id);
+       if (!err)
+               tmp_err = 0;
+}
+
  static int cell_destroy(struct cell *cell)
  {
         unsigned int cpu;
         int err;

-       err = jailhouse_call_arg1(JAILHOUSE_HC_CELL_DESTROY, cell->id);
+       tmp_err = -1;
+       on_each_cpu(trigger_race_via_cell_destroy, &cell->id, true);
+       err = tmp_err;
         if (err)
                 return err;


This test first of all causes a deadlock pretty early:

CPU 0                   CPU 1
-----                   -----
hypercall()             hypercall()
 cell_destroy()          cell_destroy()
  ...cell_suspend(root)   ...cell_suspend(root)
   arch_cpu_suspend(1)     arch_cpu_suspend(0)

=> DEADLOCK

That's kind of expected because we do not account for this concurrency with the management API. I thought that was documented in the API description but it seems there is some room for improvements.

I'm still undecided if we should address such races beyond that.

Jan

--
You received this message because you are subscribed to the Google Groups 
"Jailhouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jailhouse-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to