[
https://issues.apache.org/jira/browse/THRIFT-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537938#comment-15537938
]
James E. King, III edited comment on THRIFT-3932 at 10/1/16 6:15 PM:
---------------------------------------------------------------------
I found a number of issues that needed fixing:
# There are three monitors used for signaling; two of them use the same mutex
and then the workerMonitor is using its own. I made all three use the same
mutex.
# I replaced all calls to Synchronize with Guard on the mutex_ so it is a bit
clearer that there is one lock protecting everything.
# The worker run loop expires tasks as it encounters them. Now we only call
removeExpiredTasks if there's no room during add.
# I found that when worker threads are removed they are not joined; if the
thread manager is not detached we now join those threads.
# I added better documentation and simplified the Worker::run() a little so it
is easier to understand, but the logic is the same. idle_ was not used and
removed.
# I found that remove(Task) simply wasn't implemented at all! so I added code
for that and tests.
# Release mode tests of thread manager would not fail on error because they use
assert() to verify results! If we only do release builds in CI, these tests
may have been silently failing for a long time.
# ThreadManager::join() is unnecessary, as stop() will join worker threads (as
will removeWorker) depending on the ThreadFactory's detached disposition.
# Cleaned up some technical debt from the TServerFramework refactoring:
## removed ThreadFactory::setDetached() - changing detached setting while
running is dangerous!
## enforce detached disposition in constructor of TThreadPool (it only works
with joinable threads)
## fix default arguments for TThreadPool to use joinable ThreadFactory
# Cleaned up some things in the different platform thread factories. Some
abstractions were unnecessary.
# Reduced overall time on concurrency test to about 30 seconds on my dev system
while increasing the reliability and improving test coverage significantly. I
ran it 100 times in a loop.
# Added tests for most of the ThreadManager APIs.
# Tested with posix, std, and boost threads on linux.
# Tested on Windows.
# Fixed a math error in time calculation that was present on Windows (with
VS2010) which made expiring tasks impossible (unit test proved the issue and
the fix).
# [in progress] Re-enable unit tests in the appveyor build that are no longer
unstable.
was (Author: jking3):
I found a number of issues that needed fixing:
# There are three monitors used for signaling; two of them use the same mutex
and then the workerMonitor is using its own. I made all three use the same
mutex.
# I replaced all calls to Synchronize with Guard on the mutex_ so it is a bit
clearer that there is one lock protecting everything.
# I found that the task expiration mechanism that is called when adding a task
did not expire everything it could before attempting to add.
# I made the worker run loop aware of expired tasks since removeExpiredTasks
changed to a O(n) method (above). Only call removeExpiredTasks if there's no
room.
# I found that when worker threads are removed they are not joined; if the
thread manager is not detached we now join those threads.
# I added better documentation and simplified the Worker::run() a little so it
is easier to understand, but the logic is the same. idle_ was not used and
removed.
# I found that remove(Task) simply wasn't implemented at all! so I added code
for that and tests.
# Release mode tests of thread manager would not fail on error because they use
assert() to verify results! If we only do release builds in CI, these tests
may have been silently failing for a long time.
# ThreadManager::join() is unnecessary, as stop() will join worker threads (as
will removeWorker) depending on the ThreadFactory's detached disposition.
# Cleaned up some technical debt from the TServerFramework refactoring:
## removed ThreadFactory::setDetached() - changing detached setting while
running is dangerous!
## enforce detached disposition in constructor of TThreadPool (it only works
with joinable threads)
## fix default arguments for TThreadPool to use joinable ThreadFactory
# Cleaned up some things in the different platform thread factories. Some
abstractions were unnecessary.
# Reduced overall time on concurrency test to about 30 seconds on my dev system
while increasing the reliability and improving test coverage significantly. I
ran it 100 times in a loop.
# Added tests for most of the ThreadManager APIs.
# Tested with posix, std, and boost threads on linux.
# Tested on Windows.
# Fixed a math error in time calculation that was present on Windows (with
VS2010) which made expiring tasks impossible (unit test proved the issue and
the fix).
> C++ ThreadManager has a rare termination race
> ---------------------------------------------
>
> Key: THRIFT-3932
> URL: https://issues.apache.org/jira/browse/THRIFT-3932
> Project: Thrift
> Issue Type: Bug
> Components: C++ - Library
> Reporter: Buğra Gedik
> Assignee: James E. King, III
> Attachments: thrift-patch
>
> Time Spent: 8h
> Remaining Estimate: 0h
>
> {{ThreadManger::join}} calls {{stopImpl(true)}}, which in turn calls
> {{removeWorker(workerCount_);}}. The latter waits until {{while (workerCount_
> != workerMaxCount_)}}. Within the {{run}} method of the workers, the last
> thread that detects {{workerCount_ == workerMaxCount_}} notifies
> {{removeWorker}}. The {{run}} method has the following additional code that
> is executed at the very end:
> {code}
> {
> Synchronized s(manager_->workerMonitor_);
> manager_->deadWorkers_.insert(this->thread());
> if (notifyManager) {
> manager_->workerMonitor_.notify();
> }
> }
> {code}
> This is an independent synchronized block. Now assume 2 threads. One of them
> has {{notifyManager=true}} as it detected the {{workerCount_ ==
> workerMaxCount_}} condition earlier. It is possible that this thread gets to
> execute the above code block first, {{ThreadManager}}'s {{removeWorker}}
> method unblocks, and eventually {{ThreadManager}}'s {{join}} returns and the
> object is destructed. When the other thread reaches the synchronized block
> above, it will crash, as the manager is not around anymore.
> Besides, {{ThreadManager}} never joins its threads.
> Attached is a patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)