Sorry, I fat-finger-sent an incomplete reply, disregard it. This e-mail contains all the comments.
On 18 September 2017 at 18:52, Jonathan Rajotte <[email protected]> wrote: > RFC: > This is necessary since we use the ust_app_ht_by_notify_sock to > perform the cleanup operation. Since both ust_app_unregister and > ust_app_ust_app_notify_unregister perform a remove of the app on the > ust_ust_app_by_notify_sock but only ust_app_notify_unregister actually There are some extra ust_ and ust_app_ on the last two lines ;) > end up performing a close(call_rcu) on the socket. a close(call_rcu) ? > > Other way to do fix this problem? > Let me walk through the problem as I understand it and we'll see if/where I am going off-course. An ust_app has two sockets: - command (through which the sessiond sends commands to the app) - notify (through which the application notifies the sessiond of various conditions such as new events being available, etc.) These two sockets are received from applications by the reg_apps_thread [1] which posts them to the dispatch_thread [2]. Once the dispatch thread is aware of both sockets for a given app, the two sockets are "dispatched" to their respective management threads. * The "command" socket is sent to the "apps_thread" [3] * The "notify" socket is sent to the "apps_notify_thread" [4] Let's look at what happens when an application dies. The "apps_thread" (handles application command sockets): - wakes up from its poll() and notices an error/hang-up has occurred on an application's command socket - calls ust_app_unregister() on with that socket's FD - retrieves the ust_app from the socket's FD through ust_app_ht_by_sock - flushes that application's buffers and metadata - removes the app from ust_app_ht_by_sock - removes the app from ust_app_ht_by_notify_sock (it's okay for this to fail) - removes the app from ust_app_ht (pid <-> ust_app) - enqueues a call_rcu() job to close the command socket The "apps_notify_thread" (handles application notification sockets): - wakes up from its poll() and notices an error/hand-up has occurred on an application's notification socket - calls ust_app_notify_sock_unregister() - removes the app from ust_app_ht_by_notify_sock (it's okay for this to fail since both threads are trying to clean this up) - call_rcu() to close the notify socket Now, provided that I understand the problem correctly (as stated in my reply to patch #06), you want to clean-up the command and notify sockets in the same way that they would be if their associated apps had died. That's fair. However, as seen above, cleaning up the "apps_thread" will cause it to empty all the ust_app data structures: * ust_app_ht_by_sock * ust_app_ht_by_notify_sock * ust_app_ht However, the "apps_notify_thread" needs at least one of those data structures to be present to iterate on all of the applications and perform its clean-up. Hence, you want to make sure that the "apps_notify_thread" can complete its shutdown before the "apps_thread" starts its own clean-up. Correct? I would be okay with reordering the threads' teardown to provide that guarantee. My only gripe is that it needs to be documented _extensively_ as it is not obvious at all that such a dependency exists between those threads. On the other hand, it seems to me that it would be simpler to _not_ perform the clean-up you added in the "apps_thread" and leave that to the sessiond_cleanup(), once all threads have been torn down. Not performing the clean-up in the "apps_thread" allows the clean-up (that you added) to occur in the "apps_notify_thread" as the data structures (such as ust_app_ht) are still in place. As a result of the "apps_notify_thread" clean-up, the notify sockets would eventually be closed by through the call_rcu()'s during the next grace period. This will then unblock the applications that are stuck waiting on their notify socket. Would that solve the problem or am I missing something? Jérémie [1] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/main.c#L2005 [2] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/main.c#L1749 [3] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/main.c#L1450 [4] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/ust-thread.c#L33 > Could we simply not remove it on a ust_app_unregister? And always defer > to the apps_notify_thread for cleanup? > > Update the value in the hash table to -1 and emit a close and remove > from the hash table if the value is -1? > > We could also keep a local list of fd in apps_notify_thread and use it for > cleanup instead of relying on ust_ust_app_by_notify_sock. > > I'm not sure what is the best/elegant solution here. I am not a fan of > the current solution but it working. > > Obviously this commit will be reworded and modified accordingly. > > Signed-off-by: Jonathan Rajotte <[email protected]> > --- > src/bin/lttng-sessiond/main.c | 55 > ++++++++++++++++++++++++------------------- > 1 file changed, 31 insertions(+), 24 deletions(-) > > diff --git a/src/bin/lttng-sessiond/main.c b/src/bin/lttng-sessiond/main.c > index 4a2a661f..216d0da6 100644 > --- a/src/bin/lttng-sessiond/main.c > +++ b/src/bin/lttng-sessiond/main.c > @@ -6209,16 +6209,6 @@ int main(int argc, char **argv) > } > notification_thread_running = true; > > - /* Create thread to manage application notify socket */ > - ret = pthread_create(&apps_notify_thread, default_pthread_attr(), > - ust_thread_manage_notify, (void *) NULL); > - if (ret) { > - errno = ret; > - PERROR("pthread_create notify"); > - retval = -1; > - stop_threads(); > - goto exit_apps_notify; > - } > > /* Create thread to manage application socket */ > ret = pthread_create(&apps_thread, default_pthread_attr(), > @@ -6231,6 +6221,17 @@ int main(int argc, char **argv) > goto exit_apps; > } > > + /* Create thread to manage application notify socket */ > + ret = pthread_create(&apps_notify_thread, default_pthread_attr(), > + ust_thread_manage_notify, (void *) NULL); > + if (ret) { > + errno = ret; > + PERROR("pthread_create notify"); > + retval = -1; > + stop_threads(); > + goto exit_apps_notify; > + } > + > /* Create thread to dispatch registration */ > ret = pthread_create(&dispatch_thread, default_pthread_attr(), > thread_dispatch_ust_registration, (void *) NULL); > @@ -6358,20 +6359,6 @@ exit_reg_apps: > } > > exit_dispatch: > - /* Instruct the apps thread to quit */ > - ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]); > - if (ret < 0) { > - ERR("write error on thread quit pipe"); > - } > - > - ret = pthread_join(apps_thread, &status); > - if (ret) { > - errno = ret; > - PERROR("pthread_join apps"); > - retval = -1; > - } > - > -exit_apps: > /* Instruct the apps_notify thread to quit */ > ret = notify_thread_pipe(thread_apps_notify_teardown_trigger_pipe[1]); > if (ret < 0) { > @@ -6386,6 +6373,26 @@ exit_apps: > } > > exit_apps_notify: > + /* > + * The barrier ensure that all previous resources, notify sockets in > + * particular, are freed/closed. > + */ > + rcu_barrier(); > + > + /* Instruct the apps thread to quit */ > + ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]); > + if (ret < 0) { > + ERR("write error on thread quit pipe"); > + } > + > + ret = pthread_join(apps_thread, &status); > + if (ret) { > + errno = ret; > + PERROR("pthread_join apps"); > + retval = -1; > + } > + > +exit_apps: > exit_notification: > exit_health: > exit_init_data: > -- > 2.11.0 > -- Jérémie Galarneau EfficiOS Inc. http://www.efficios.com _______________________________________________ lttng-dev mailing list [email protected] https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
