lhotari commented on pull request #9764: URL: https://github.com/apache/pulsar/pull/9764#issuecomment-788689691
> The primary goal of having these custom concurrent maps implementation was actually to avoid memory allocations to the max possible extent :) > > Even if these allocations are short lived, they still contribute to generate garbage that will fill up the new-gen space at a faster rate. That in turn will cause more "in-flight", reasonably short-lived objects to get caught in a collection event and copied over to old-gen spaces, where they'll have to get compacted later on. Yes, it makes sense to avoid memory allocations in general, but here there could be reason to make a compromise in order to make Pulsar more stable. In this case, the memory allocated in `forEachInSnapshot` is a tiny fraction of all memory allocations happening. The `ConcurrentOpenHashMap.forEach` calls aren't on the fast path of processing where a lot of garbage could be generated. The amount of memory allocations are in the order of a few kilobytes per second since it's only about a few object references that consume memory (about 28 bytes per item: 16+4+4 bytes for Entry, 4 bytes for each item in the ArrayList). Is GC pressure really a problem in this case and how `.forEach` is used in the Pulsar code base? > We shouldn't be doing blocking calls from within the process method. I think we should rather refactor that code instead. This is easier said than done. It's a major refactoring to refactor the code in that way. It won't happen quickly. There's plenty of code that does blocking calls within `forEach` or `putIfAbsent` methods. My assumption is that Zookeeper is involved in a lot of the cases. After reviewing how the current `forEach` method relies on optimistic locking, I don't think there is risk from data consistency perspective in switching to use the `forEachInSnapshot` method for processing elements of ConcurrentOpenHashMaps. That was something that was my major concern initially that the behavior would be a lot different. It's no more a concern for me. I have added a few signs of deadlocks from ReplicatorTest in comments of this PR. All such issues go away after applying the changes in this PR. That's why I'm confident that this PR is useful. In the past, I have been investigating some problems in real environments where the symptoms observed from the thread dumps look similar to thread dumps from stalled ReplicatorTests. A certain set of problems seem to happen only in Pulsar deployments where there's a large amount of topics. When there's a lot of topics, the `.forEach` iterations take a long time. There's a lot more changes that the nondeterministic deadlocks occur. That is something that the changes in this PR would prevent. **Perhaps we could evaluate each individual usage of `.forEach`, case-per-case, and decide whether to use `.forEach` or `.forEachInSnapshot`? When there's a change the the processing uses blocking methods, we would choose `.forEachInSnapshot` and if it's computation-only, `.forEach` would be used? @merlimat WDYT?** ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
