[jira] [Updated] (THRIFT-1795) Race condition in TThreadedServerPool java implementation

Venkat Ranganathan (JIRA) Mon, 17 Dec 2012 22:02:20 -0800

     [ 
https://issues.apache.org/jira/browse/THRIFT-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Venkat Ranganathan updated THRIFT-1795:
---------------------------------------

    Component/s: Java - Library
    Description: 
While running unit tests on flume[1], it was discovered that there have been 
intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux 
environments (or the failures where more pronounced in those environments).   
While trying to debug this issue and removing Flume specific issues,  I think 
there is a race condition in Thrift TThreadPool server implementation. We can 
reduce the problem to the basic issue

The server is written in as below in its serve() method

stopped_ = false

while (!stopped_) 
   try
        serve request
   except
       log message

The stop method (executed typically from another thread) is implemented thus

stopped_ = false
interrupt and cleanup server transport



The basic thought is that stopped_ being volatile, the writes to stopped_ will 
always be seen by the reading thread and there is no other synchronization 
required, essentially relying on the JSR 133 volatile description.

The problem is that there is no check in the server run method whether it was 
asked to stop or not

For example, the Flume unit test, once again reduced to what is needed for this 
discussion, does the following for testing the flume source lifecycle

server.start -- which means starting a thread, whose run method will call  
TThreadPoolServer.serve described above
server.stop

It so happens that on Redhat Linux 5.x systems or more specifically with Linux 
kernels before the CFQ scheduler, there is a higher chance that server.stop can 
run before server thread's serve method is invoked.   This means that the 
setting of stopped_ flag in the stop method is overwritten in the serve method. 
   This causes the server to enter an infinite loop and eventually, the test 
timeout kicks in

There is an easy fix - just set stopped_ to false in the constructor.   But 
then it probably violates the contract on the state of a constructed server 
object.   I see that in couple of other TServer implementations, stopped_ is 
initialize to true and only during the listen processing, stopped_ is set to 
false.

If preserving this semantics is important, we should introduce another volatile 
variable (say started_),  and carefully set started_ and stopped_ in places 
where we are not sure stopped_ might be overwritten, and check for both flags 
also in key locations (right after a thread starts with the server() method.

This same issue can be evidenced in other variations of TServer implementations 
also.

I can provide a patch if it is agreed that this is a bug that needs to be 
fixed.  I can create a patch introducing a new started_ flag (which respects 
the current behavior in some of the classes where stopped_ is initialized to 
true), or initialize the Servers to initialize stopped_ to false and then not 
reset the stopped_ flag during the server execution (which is simplere and easy 
to fix, but makes a server just constructed to be not in stopped state).   I 
don't know how important it is to maintain that behavior or if there is any 
reliance on that behavior

Thanks

Venkat


  was:
While running unit tests on flume[1], it was discovered that there have been 
intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux 
environments (or the failures where more pronounced in those environments).   
While trying to debug this issue and removing Flume specific issues,  I think 
there is a race condition in Thrift TThreadPool server implementation. We can 
reduce the problem to the basic issue

The server is written in as below in its serve() method

stopped_ = false

while (!stopped_) 
   try
        serve request
   except
       log message

The stop method (executed typically from another thread) is implemented thus

stopped_ = false
interrupt and cleanup server transport



The basic thought is that stopped_ being volatile, the writes to stopped_ will 
always be seen by the reading thread and there is no other synchronization 
required, essentially relying on the JSR 133 volatile description.

The problem is that there is no check in the server run method whether it was 
asked to stop or not

For example, the Flume unit test, once again reduced to what is needed for this 
discussion, does the following for testing the flume source lifecycle

server.start -- which means starting a thread, whose run method will call  
TThreadPoolServer.serve described above
server.stop

It so happens that on Redhat Linux 5.x systems or more specifically with Linux 
kernels before the CFQ scheduler, there is a higher chance that server.stop can 
run before server thread’s serve method is invoked.   This means that the 
setting of stopped_ flag in the stop method is overwritten in the serve method. 
   This causes the server to enter an infinite loop and eventually, the test 
timeout kicks in

There is an easy fix – just set stopped_ to false in the constructor.   But 
then it probably violates the contract on the state of a constructed server 
object.   I see that in couple of other TServer implementations, stopped_ is 
initialize to true and only during the listen processing, stopped_ is set to 
false.

If preserving this semantics is important, we should introduce another volatile 
variable (say started_),  and carefully set started_ and stopped_ in places 
where we are not sure stopped_ might be overwritten, and check for both flags 
also in key locations (right after a thread starts with the server() method.

This same issue can be evidenced in other variations of TServer implementations 
also.

I can provide a patch if it is agreed that this is a bug that needs to be 
fixed.  I can create a patch introducing a new started_ flag (which respects 
the current behavior in some of the classes where stopped_ is initialized to 
true), or initialize the Servers to initialize stopped_ to false and then not 
reset the stopped_ flag during the server execution (which is simplere and easy 
to fix, but makes a server just constructed to be not in stopped state).   I 
don't know how important it is to maintain that behavior or if there is any 
reliance on that behavior

Thanks

Venkat


    
> Race condition in TThreadedServerPool java implementation
> ---------------------------------------------------------
>
>                 Key: THRIFT-1795
>                 URL: https://issues.apache.org/jira/browse/THRIFT-1795
>             Project: Thrift
>          Issue Type: Bug
>          Components: Java - Library
>    Affects Versions: 0.6.1, 0.7, 0.8, 0.9
>         Environment: RHEL 5.x derivatives, SUSE linux are the systems where 
> this is intermittently reproducible
>            Reporter: Venkat Ranganathan
>
> While running unit tests on flume[1], it was discovered that there have been 
> intermittent failures on Redhat Linux 5.x derivatives and SUSE Linux 
> environments (or the failures where more pronounced in those environments).   
> While trying to debug this issue and removing Flume specific issues,  I think 
> there is a race condition in Thrift TThreadPool server implementation. We can 
> reduce the problem to the basic issue
> The server is written in as below in its serve() method
> stopped_ = false
> while (!stopped_) 
>    try
>         serve request
>    except
>        log message
> The stop method (executed typically from another thread) is implemented thus
> stopped_ = false
> interrupt and cleanup server transport
> The basic thought is that stopped_ being volatile, the writes to stopped_ 
> will always be seen by the reading thread and there is no other 
> synchronization required, essentially relying on the JSR 133 volatile 
> description.
> The problem is that there is no check in the server run method whether it was 
> asked to stop or not
> For example, the Flume unit test, once again reduced to what is needed for 
> this discussion, does the following for testing the flume source lifecycle
> server.start -- which means starting a thread, whose run method will call  
> TThreadPoolServer.serve described above
> server.stop
> It so happens that on Redhat Linux 5.x systems or more specifically with 
> Linux kernels before the CFQ scheduler, there is a higher chance that 
> server.stop can run before server thread's serve method is invoked.   This 
> means that the setting of stopped_ flag in the stop method is overwritten in 
> the serve method.    This causes the server to enter an infinite loop and 
> eventually, the test timeout kicks in
> There is an easy fix - just set stopped_ to false in the constructor.   But 
> then it probably violates the contract on the state of a constructed server 
> object.   I see that in couple of other TServer implementations, stopped_ is 
> initialize to true and only during the listen processing, stopped_ is set to 
> false.
> If preserving this semantics is important, we should introduce another 
> volatile variable (say started_),  and carefully set started_ and stopped_ in 
> places where we are not sure stopped_ might be overwritten, and check for 
> both flags also in key locations (right after a thread starts with the 
> server() method.
> This same issue can be evidenced in other variations of TServer 
> implementations also.
> I can provide a patch if it is agreed that this is a bug that needs to be 
> fixed.  I can create a patch introducing a new started_ flag (which respects 
> the current behavior in some of the classes where stopped_ is initialized to 
> true), or initialize the Servers to initialize stopped_ to false and then not 
> reset the stopped_ flag during the server execution (which is simplere and 
> easy to fix, but makes a server just constructed to be not in stopped state). 
>   I don't know how important it is to maintain that behavior or if there is 
> any reliance on that behavior
> Thanks
> Venkat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (THRIFT-1795) Race condition in TThreadedServerPool java implementation

Reply via email to