lhotari commented on pull request #9764:
URL: https://github.com/apache/pulsar/pull/9764#issuecomment-788689691


   > The primary goal of having these custom concurrent maps implementation was 
actually to avoid memory allocations to the max possible extent :)
   > 
   > Even if these allocations are short lived, they still contribute to 
generate garbage that will fill up the new-gen space at a faster rate. That in 
turn will cause more "in-flight", reasonably short-lived objects to get caught 
in a collection event and copied over to old-gen spaces, where they'll have to 
get compacted later on.
   
   Yes, it makes sense to avoid memory allocations in general, but here there 
could be reason to make a compromise in order to make Pulsar more stable.  
   
   In this case, the memory allocated in `forEachInSnapshot` is a tiny fraction 
of all memory allocations happening. The `ConcurrentOpenHashMap.forEach` calls 
aren't on the fast path of processing where a lot of garbage could be 
generated. The amount of memory allocations are in the order of a few kilobytes 
per second since it's only about a few object references that consume memory 
(about 28 bytes per item: 16+4+4 bytes for Entry, 4 bytes for each item in the 
ArrayList). Is GC pressure really a problem in this case and how `.forEach` is 
used in the Pulsar code base? 
   
   > We shouldn't be doing blocking calls from within the process method. I 
think we should rather refactor that code instead.
   
   This is easier said than done. It's a major refactoring to refactor the code 
in that way. It won't happen quickly. There's plenty of code that does blocking 
calls within `forEach` or `putIfAbsent` methods. My assumption is that 
Zookeeper is involved in a lot of the cases. 
   
   After reviewing how the current `forEach` method relies on optimistic 
locking, I don't think there is risk from data consistency perspective in 
switching to use the `forEachInSnapshot` method for processing elements of 
ConcurrentOpenHashMaps. That was something that was my major concern initially 
that the behavior would be a lot different. It's no more a concern for me.
   
   I have added a few signs of deadlocks from ReplicatorTest in comments of 
this PR. All such issues go away after applying the changes in this PR. That's 
why I'm confident that this PR is useful. 
   
   In the past, I have been investigating some problems in real environments 
where the symptoms observed from the thread dumps look similar to thread dumps 
from stalled ReplicatorTests. A certain set of problems seem to happen only in 
Pulsar deployments where there's a large amount of topics. When there's a lot 
of topics, the `.forEach` iterations take a long time.  There's a lot more 
changes that the nondeterministic deadlocks occur. That is something that the 
changes in this PR would prevent.
   
   **Perhaps we could evaluate each individual usage of `.forEach`, 
case-per-case, and decide whether to use `.forEach` or `.forEachInSnapshot`? 
When there's a change the the processing uses blocking methods, we would choose 
`.forEachInSnapshot` and if it's computation-only, `.forEach` would be used? 
@merlimat  WDYT?**
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to