Martin, > > it's exactly the corner cases that are difficult to get right and for > > which the completion_mutex was meant to exist. For example, what happens > > if more than one thread call Task::join? In that case, you call > > wait_for_all more than once. The mutex avoids this. > > Currently we don't have this kind of corner case anywhere in the > library: One needs to pass a Task to a subtask which then calls join(). > See attached test based on tests/base/task_04.cc. If I get you right, > such a test should work, or what do you think? But it does not for my > system with the unmodified thread_management.h (and not our modification > from yesterday either). Wolfgang, how is it on your system?
It works on my machine. But if I understand you correctly it is not supported because now a different thread may call wait_for_all than the one that started the thread. So I'm ok if we simply forbid this case (if you agree that this shouldn't work, please put a note into the Task::join documentation to this effect). That would also obviate the need for the mutex I believe. > To be honest, I don't really have a clue what exactly happens. When I > run step-12, it gets stuck before any output is produced. gdb reports > that all the threads try to acquire a completion_mutex: thread 1 gets > stuck in line 810 in fe_tools.cc, and all the other threads in line 768 > (where the subtasks are joined). So it seems that the subtasks that > should release the mutex either do not release correctly or the waiting > thread does not notice when it is released. But shouldn't these be different completion_mutex objects? Can you verify that they indeed live at different addresses? Can you put some output into compute_embedding_for_shape_function to verify that these functions indeed run to completion? > ==30865== by 0x4D139C: > dealii::Threads::internal::TaskDescriptor<>::join() > (thread_management.h:3981) > ==30865== by 0x4CD21F: dealii::Threads::Task<>::join() const > (thread_management.h:4089) > ==30865== by 0x4C8531: dealii::Threads::TaskGroup<>::join_all() const > (thread_management.h:5116) > ==30865== by 0x832318E: void dealii::FETools::(anonymous > namespace)::compute_embedding_matrices_for_refinement_case<3, double, > 3>(dealii::FiniteElement<3, 3> const&, > std::vector<dealii::FullMatrix<double>, > std::allocator<dealii::FullMatrix<double> > >&, unsigned int) > (fe_tools.cc:768) > ... > ==30865== mutex 0x16b677a0 was first observed at: > ==30865== at 0x4C261AF: pthread_mutex_lock > (drd_pthread_intercepts.c:584) > ==30865== by 0x4A614F: __gthread_mutex_lock(pthread_mutex_t*) > (gthr-default.h:758) > ==30865== by 0x4AC59F: std::mutex::lock() (mutex:88) > ==30865== by 0x4AC601: dealii::Threads::Mutex::acquire() > (thread_management.h:390) > ==30865== by 0x4D60D8: > dealii::Threads::internal::TaskDescriptor<>::queue_task() > (thread_management.h:3900) > ==30865== by 0x4D11A0: dealii::Threads::Task<>::Task(std::function<> > const&) (thread_management.h:4060) > ... > ==30865== by 0x8323142: void dealii::FETools::(anonymous > namespace)::compute_embedding_matrices_for_refinement_case<3, double, > 3>(dealii::FiniteElement<3, 3> const&, > std::vector<dealii::FullMatrix<double>, > std::allocator<dealii::FullMatrix<double> > >&, unsigned int) > (fe_tools.cc:764) > ... > > This error message says that we try to recursively acquire the same > mutex (on the _same_ thread): We first acquire it in > TaskDescriptor::queue_task (line 3900 in thread_management.h), and then > try to acquire it again when we wait for the child to finish in > TaskDescriptor::join (line 3981) in order to know whether the child has > released it. As far as I understand things, it is system-dependent > whether the same process can acquire a lock that it already holds and > might even give unpredictable results (as I see them here). This is > opposed to _other_ threads trying to acquire the mutex, which is the > usual case and why one wants to use mutexes in the first place. > Wolfgang, do you see a solution for this? This seems wrong indeed. I suppose the idea here was to avoid calling join() before the thread has even started. I imagined that the call to pthread_mutex_lock in join() would simply block when called before the task has released the lock (and that may well be what happens on my system) but what you're saying is that it simply acquires the lock again, defeating the purpose. Is there a way to achieve the intended effect, e.g. by not calling pthread_mutex_lock but some other pthread_mutex_* function? Best W. ------------------------------------------------------------------------- Wolfgang Bangerth email: [email protected] www: http://www.math.tamu.edu/~bangerth/ _______________________________________________ dealii mailing list http://poisson.dealii.org/mailman/listinfo/dealii
