Divye Kapoor created THRIFT-5430:
------------------------------------

             Summary: [Deadlock] FieldMetaData synchronized method can trigger 
deadlock during static class initialization
                 Key: THRIFT-5430
                 URL: https://issues.apache.org/jira/browse/THRIFT-5430
             Project: Thrift
          Issue Type: Bug
          Components: Java - Library
    Affects Versions: 0.14.1, 0.14.0, 0.13.0, 0.12.0, 0.11.0, 0.10.0, 0.9.3.1, 
0.9.3, 0.9.2, 0.9.1, 0.9, 0.8, 0.7, 0.6.1, 0.6
         Environment: AdoptJDK 1.8 (Java 8) ; Linux, AWS machines. 

Deadlock is environment independent and occurs in conformance with the JRE 8 
spec.

 
            Reporter: Divye Kapoor


We (Pinterest) hit the following deadlock during JVM classloading of Thrift 
classes. The root cause was triggering sClass.newInstance() while holding the 
synchronized lock on FieldMetaData.class::getStructMetaDataMap(..)

Here's the stacktraces of interest: 
Thread 1:
{code:java}
Stack Trace is:
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)  // 
stuck here for > 30 mins
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442) // creating an object via 
reflection.
at 
org.apache.thrift.meta_data.FieldMetaData.getStructMetaDataMap(FieldMetaData.java:61)-
 locked java.lang.Class@346242bb // Lock held on FieldMetaData.class
at 
com.pinterest.commons.thrift.ThriftStructDescriptor.<init>(ThriftStructDescriptor.java:56)
at 
com.pinterest.commons.thrift.ThriftStructDescriptor.getDescriptor(ThriftStructDescriptor.java:38)
 {code}
Thread 2:
{code:java}
Stack Trace is:
at 
org.apache.thrift.meta_data.FieldMetaData.addStructMetaDataMap(FieldMetaData.java:49)
- blocked on java.lang.Class@346242bb  // Lock held by Thread 1.
at 
com.pinterest.schemas.search.query.NavboostScore.<clinit>(NavboostScore.java:146)
at 
com.pinterest.schemas.search.query.NavboostScores.read(NavboostScores.java:334)
at 
com.pinterest.schemas.search.query.SupplementaryTerm.read(SupplementaryTerm.java:1209)
at 
com.pinterest.schemas.search.query.SupplementaryTerms.read(SupplementaryTerms.java:335)
at com.pinterest.schemas.search.query.QueryJoin.read(QueryJoin.java:6514)
at 
com.pinterest.pinpin.thrift.SerializedResourceContainer.read(SerializedResourceContainer.java:2867)
at org.apache.thrift.TDeserializer.deserialize(TDeserializer.java:69) // 
General Thrift deserialize. {code}

Here's the code of interest:

[https://github.com/apache/thrift/blob/65fb49bb41f852375b278c9057d52c9472f0cb3a/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java#L61]
 


Thread 1 has the following lock acquisition order:
{noformat}
FieldMetaData.class::getStructMetaDataMap -> lock FieldMetaData.class -> 
sClass.newInstance() -> load sClass -> JVM init_lock -> waiting for Thread 2 to 
complete static initialization. {noformat}
Thread 2 has the following lock acquisition order:
{noformat}
sClass.newInstance() (from new ThriftObject() / deserialization) ->  load 
sClass -> static initialization -> try locking FieldMetaData.class when calling 
FieldMetaData.addStructMetaDataMap{noformat}
Internally, this was a fairly detailed investigation. Would be great if we 
agree on a Fix approach. This deadlock affects all versions of Thrift since 0.9 
which introduced the synchronized keyword. Versions prior to 0.9 are simply 
thread unsafe (because there's no lock on FieldMetaData's internal HashMap and 
the two static method calls can race). I didn't check versions below 0.5 but 
probably this extends all the way back. 

This is not an issue in fbThrift, which correctly uses a ConcurrentHashMap and 
does not take a lock on FieldMetaData. See this code here:

[https://github.com/facebook/fbthrift/blob/c0f8ffb345d847df512ae9c404a58379fcd51cb9/thrift/lib/java/src/main/java/com/facebook/thrift/meta_data/FieldMetaData.java]
 vs the buggy code here:

[https://github.com/apache/thrift/blob/v0.14.1/lib/java/src/org/apache/thrift/meta_data/FieldMetaData.java]
 

Another alternative approach is to use 
{code:java}
synchronized (structMap} {
 ... 
} {code}
instead of the synchronized keyword on the methods. 

Please confirm which approach is preferred and we can send a PR on Github. 

 

Here's the original deadlock hypothesis as outlined by Jiahuan Liu:

The hypothesis of the deadlock:
ThriftStructDescriptor.get is used in many places for deserialization and it 
calls FieldMetaData.getStructMetaDataMap. inside this method, it tries to load 
the thrift class if the thrift has not been loaded yet. thrift classes call 
FieldMetaData.addStructMetaDataMap in a static block. so the deadlock may 
happen when:
 * thread A is trying to deserialize -> getStructMetaDataMap -> class loading
 * if thread B is already trying to load the same thrift, it requires 
addStructMetaDataMap to complete but this method is blocked by thread A 
getStructMetaDataMap
 * the class loading in thread A will be blocked by thread B because class 
loading is also synchronized by java. see this section in java spec:

 If the Class object for C indicates that initialization is in progress for C 
by some other thread, then release LC and block the current thread until 
informed that the in-progress initialization has completed, at which time 
repeat this step.

Here's the classloading spec in java 8:
[https://docs.oracle.com/javase/specs/jls/se8/html/jls-12.html#jls-12.4.2] 

Thread 1 is stuck in step 2 of the initialization (waiting for Thread 2 to 
notify in Step 10).
Thread 2 is stuck in step 9 of the initialization and never makes it to step 
10. 

Please reach out if this is unclear. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to