[
https://issues.apache.org/jira/browse/THRIFT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jake Farrell reopened THRIFT-1795:
----------------------------------
Assignee: Jake Farrell (was: Roger Meier)
> Race condition in TThreadedServerPool java implementation
> ---------------------------------------------------------
>
> Key: THRIFT-1795
> URL: https://issues.apache.org/jira/browse/THRIFT-1795
> Project: Thrift
> Issue Type: Bug
> Components: Java - Library
> Affects Versions: 0.6.1, 0.7, 0.8, 0.9
> Environment: RHEL 5.x derivatives, SUSE linux are the systems where
> this is intermittently reproducible
> Reporter: Venkat Ranganathan
> Assignee: Jake Farrell
> Attachments: Thrift-1795.diff
>
>
> While running unit tests on flume[1], it was discovered that there have been
> intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux
> environments (or the failures where more pronounced in those environments).
> While trying to debug this issue and removing Flume specific issues, I think
> there is a race condition in Thrift TThreadPool server implementation. We can
> reduce the problem to the basic issue
> The server is written in as below in its serve() method
> stopped_ = false
> while (!stopped_)
> try
> serve request
> except
> log message
> The stop method (executed typically from another thread) is implemented thus
> stopped_ = false
> interrupt and cleanup server transport
> The basic thought is that stopped_ being volatile, the writes to stopped_
> will always be seen by the reading thread and there is no other
> synchronization required, essentially relying on the JSR 133 volatile
> description.
> The problem is that there is no check in the server run method whether it was
> asked to stop or not
> For example, the Flume unit test, once again reduced to what is needed for
> this discussion, does the following for testing the flume source lifecycle
> server.start -- which means starting a thread, whose run method will call
> TThreadPoolServer.serve described above
> server.stop
> It so happens that on Redhat Linux 5.x systems or more specifically with
> Linux kernels before the CFQ scheduler, there is a higher chance that
> server.stop can run before server thread's serve method is invoked. This
> means that the setting of stopped_ flag in the stop method is overwritten in
> the serve method. This causes the server to enter an infinite loop and
> eventually, the test timeout kicks in
> There is an easy fix - just set stopped_ to false in the constructor. But
> then it probably violates the contract on the state of a constructed server
> object. I see that in couple of other TServer implementations, stopped_ is
> initialize to true and only during the listen processing, stopped_ is set to
> false.
> If preserving this semantics is important, we should introduce another
> volatile variable (say started_), and carefully set started_ and stopped_ in
> places where we are not sure stopped_ might be overwritten, and check for
> both flags also in key locations (right after a thread starts with the
> server() method.
> This same issue can be evidenced in other variations of TServer
> implementations also.
> I can provide a patch if it is agreed that this is a bug that needs to be
> fixed. I can create a patch introducing a new started_ flag (which respects
> the current behavior in some of the classes where stopped_ is initialized to
> true), or initialize the Servers to initialize stopped_ to false and then not
> reset the stopped_ flag during the server execution (which is simplere and
> easy to fix, but makes a server just constructed to be not in stopped state).
> I don't know how important it is to maintain that behavior or if there is
> any reliance on that behavior
> Thanks
> Venkat
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira