Venkat Ranganathan created THRIFT-1795:
------------------------------------------
Summary: Race condition in TThreadedServerPool java implementation
Key: THRIFT-1795
URL: https://issues.apache.org/jira/browse/THRIFT-1795
Project: Thrift
Issue Type: Bug
Affects Versions: 0.9, 0.8, 0.7, 0.6.1
Environment: RHEL 5.x derivatives, SUSE linux are the systems where
this is intermittently reproducible
Reporter: Venkat Ranganathan
While running unit tests on flume[1], it was discovered that there have been
intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux
environments (or the failures where more pronounced in those environments).
While trying to debug this issue and removing Flume specific issues, I think
there is a race condition in Thrift TThreadPool server implementation. We can
reduce the problem to the basic issue
The server is written in as below in its serve() method
stopped_ = false
while (!stopped_)
try
serve request
except
log message
The stop method (executed typically from another thread) is implemented thus
stopped_ = false
interrupt and cleanup server transport
The basic thought is that stopped_ being volatile, the writes to stopped_ will
always be seen by the reading thread and there is no other synchronization
required, essentially relying on the JSR 133 volatile description.
The problem is that there is no check in the server run method whether it was
asked to stop or not
For example, the Flume unit test, once again reduced to what is needed for this
discussion, does the following for testing the flume source lifecycle
server.start -- which means starting a thread, whose run method will call
TThreadPoolServer.serve described above
server.stop
It so happens that on Redhat Linux 5.x systems or more specifically with Linux
kernels before the CFQ scheduler, there is a higher chance that server.stop can
run before server thread’s serve method is invoked. This means that the
setting of stopped_ flag in the stop method is overwritten in the serve method.
This causes the server to enter an infinite loop and eventually, the test
timeout kicks in
There is an easy fix – just set stopped_ to false in the constructor. But
then it probably violates the contract on the state of a constructed server
object. I see that in couple of other TServer implementations, stopped_ is
initialize to true and only during the listen processing, stopped_ is set to
false.
If preserving this semantics is important, we should introduce another volatile
variable (say started_), and carefully set started_ and stopped_ in places
where we are not sure stopped_ might be overwritten, and check for both flags
also in key locations (right after a thread starts with the server() method.
This same issue can be evidenced in other variations of TServer implementations
also.
I can provide a patch if it is agreed that this is a bug that needs to be
fixed. I can create a patch introducing a new started_ flag (which respects
the current behavior in some of the classes where stopped_ is initialized to
true), or initialize the Servers to initialize stopped_ to false and then not
reset the stopped_ flag during the server execution (which is simplere and easy
to fix, but makes a server just constructed to be not in stopped state). I
don't know how important it is to maintain that behavior or if there is any
reliance on that behavior
Thanks
Venkat
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira