[GitHub] [pulsar] lhotari commented on pull request #12606: [ML] Avoid passing OpAddEntry across a thread boundary in asyncAddEntry

GitBox Wed, 03 Nov 2021 21:53:26 -0700


lhotari commented on pull request #12606:
URL: https://github.com/apache/pulsar/pull/12606#issuecomment-960458045

> This change makes sense to me. Do you have any evidence of a problem that
this PR is fixing?

Direct evidence in a form of a failing test is hard for a lot of thread
safety issues.

In many cases, it is worth checking that code follows the rules of the Java
Memory Model (JMM), specified in the Java Language Specification (JLS). In Java
on 64-bit JVMs, there isn't a problem of corruption of memory pointers since
"word tearing" problems are prevented. The main issue is possible data
consistency and data race issues. ["Incorrectly Synchronized Programs May
Exhibit Surprising
Behavior"](https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.4-A)
is how JLS explains it. I guess the verification activity is like a model
checking exercise, but at a different level.

My assumption is that the use of recycled objects and sharing them between
multiple threads leads to "incorrectly synchronized programs" within Pulsar
code base. It doesn't corrupt memory, but we might see "surprising behavior".
One of the surprising behavior comes from the possibility of a field value
changing even between subsequent reads while the field value hasn't been
changed. This is explained in [Aleksey Shipilëv's Java Memory Model Unlearning
Experience presentation](https://www.youtube.com/watch?v=TK-7GCCDF_I). [at
about 21 minutes 50
seconds](https://www.youtube.com/watch?v=TK-7GCCDF_I&t=21m50s), there's an
interesting example:

![image](https://user-images.githubusercontent.com/66864/116509894-bb0e3f80-a8cc-11eb-9d02-981ac3e33732.png)

In this example, the first thread makes a single change to field x and sets
it to 1. The second thread reads this field twice. On the first read, the value
is 1 and on the second read, the value it 0. This is something that one would
assume that "is impossible". This is one example of surprising behavior of
incorrectly synchronized programs.

There are multiple ways to mitigate the issue. The safest general approach
is to ensure that recycled objects are exposed only in a single thread.

The main motivation of this PR is to reduce the likelyhood of "surprising
behavior" related to the use of Netty recycled object, OpAddEntry. The Netty
Recycler pool is thread local. When the creation (and lookup from the Netty
Recycler) of OpAddEntry is moved in the same thread where the OpAddEntry is
mainly mutated and read, it reduces the likelyhood of possible data races
caused by sharing objects across threads without proper synchronization. This
isn't perfect, but it's an improvement. Currently ManagedLedgerImpl uses a
single thread per ledger. There are gaps in the solution. The goal of my draft
PR #11387 would be to continue the improvements and eventually fix the current
thread safety issues that the use of recycled OpAddEntry and OpReadEntry
instances introduce in Pulsar.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] lhotari commented on pull request #12606: [ML] Avoid passing OpAddEntry across a thread boundary in asyncAddEntry

Reply via email to