lhotari commented on pull request #12606:
URL: https://github.com/apache/pulsar/pull/12606#issuecomment-960458045


   > This change makes sense to me. Do you have any evidence of a problem that 
this PR is fixing?
   
   Direct evidence in a form of a failing test is hard for a lot of thread 
safety issues. 
   
   In many cases, it is worth checking that code follows the rules of the Java 
Memory Model (JMM), specified in the Java Language Specification (JLS). In Java 
on 64-bit JVMs, there isn't a problem of corruption of memory pointers since 
"word tearing" problems are prevented. The main issue is possible data 
consistency and data race issues. ["Incorrectly Synchronized Programs May 
Exhibit Surprising 
Behavior"](https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.4-A)
 is how JLS explains it. I guess the verification activity is like a model 
checking exercise, but at a different level.
   
   My assumption is that the use of recycled objects and sharing them between 
multiple threads leads to "incorrectly synchronized programs" within Pulsar 
code base. It doesn't corrupt memory, but we might see "surprising behavior". 
One of the surprising behavior comes from the possibility of a field value 
changing even between subsequent reads while the field value hasn't been 
changed. This is explained in [Aleksey Shipilëv's Java Memory Model Unlearning 
Experience presentation](https://www.youtube.com/watch?v=TK-7GCCDF_I).  [at 
about 21 minutes 50 
seconds](https://www.youtube.com/watch?v=TK-7GCCDF_I&t=21m50s), there's an 
interesting example:
   
![image](https://user-images.githubusercontent.com/66864/116509894-bb0e3f80-a8cc-11eb-9d02-981ac3e33732.png)
   
   In this example, the first thread makes a single change to field x and sets 
it to 1. The second thread reads this field twice. On the first read, the value 
is 1 and on the second read, the value it 0. This is something that one would 
assume that "is impossible". This is one example of surprising behavior of 
incorrectly synchronized programs.
   
   There are multiple ways to mitigate the issue. The safest general approach 
is to ensure that recycled objects are exposed only in a single thread.
   
   The main motivation of this PR is to reduce the likelyhood of "surprising 
behavior" related to the use of Netty recycled object, OpAddEntry. The Netty 
Recycler pool is thread local. When the creation  (and lookup from the Netty 
Recycler) of OpAddEntry is moved in the same thread where the OpAddEntry is 
mainly mutated and read, it reduces the likelyhood of possible data races 
caused by sharing objects across threads without proper synchronization. This 
isn't perfect, but it's an improvement. Currently ManagedLedgerImpl uses a 
single thread per ledger. There are gaps in the solution. The goal of my draft 
PR #11387 would be to continue the improvements and eventually fix the current 
thread safety issues that the use of recycled OpAddEntry and OpReadEntry 
instances introduce in Pulsar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to