merlimat commented on pull request #9764: URL: https://github.com/apache/pulsar/pull/9764#issuecomment-789930324
> It's not really a defensive copy of the entire map. The copy is done one section at a time. By defensive, I mean that we'd be taking a copy of the entire map (section by section) each time we need to traverse it, even though there would no need to do it (eg: when a section is not being modified). In case of a map which is not updated super-frequently, that would impose a size-able overhead. > Memory allocations and GC is extremely efficient in modern JVMs. I've been battling GC and allocations for many years now and while that might be true to a certain extent, there are also many bad GC behaviors that we had fixed doing changes like switching from the regular `ConcurrentHashMap` to this custom implementation. > Since it's about 28 bytes per entry, there would have to have millions of entries for the memory allocation overhead to have any noticeable significance. If necessary, this could be benchmarked. Since there aren't millions of entries, but usually a few thousand of entries, it has even less significance. This is why I don't see a reason to avoid the copying in this case. The problem is that there can be easily millions of entries (aggregated from 100s of thousands of maps) on a single instance. That has been the case in many production deployments for many years. > We do have IO operations (interactions with Zookeeper) in most of the code that processes the map items and it cannot be easily removed. Sure, but we should take care of them (eg: during the refactoring to port these components to use MetadataStore instead of direct ZK access. Also, I was commenting that we could do it such that we automatically switch to individual locks (instead of whole section lock) if the processing is taking more than X amount of time. > It reduces the duration of the locks in the case that there are no writes. When there are writes, the .forEach processing will hold the read lock until all items have been processed. This is one of the reasons why the problems are rare with small number of items or low amount of write operations. However such a solution won't completely prevent deadlocks, if the model is such that it could lead to a deadlock that isn't detected or resolved after some timeout. A deadlock will eventually happen even if the probability is low. Currently the Pulsar broker locks up and doesn't resume operations if there's a deadlock. That's what makes the problem severe. Since the locks could be in specific sections of a ConcurrenOpenHashMap, it leads to partial unavailability which is hard to detect. While it's clear that a slow scan will provoke a spike in write latency into the map, I'm still not clear on the exact case of deadlock here. Typically the way to get deadlocked on the map is if, while doing a scan, you're trying to access it again from a different thread (or even same thread, since the lock is non-reentrant) and that blocks the processing function. > I see now, this is a case of deadlock, we are issuing a blocking call to ZooKeeper into a forEach, and the ZooKeeper library handles every operation in only one single thread. So the chain is pretty hard do understand @eolivelli Even if do a ZK blocking call (which we shouldn't), that doesn't necessarily mean that there will be a deadlock. It will only be the case if the ZK callback thread is blocked trying to read/write on the same map. Do you have the other threads stacks? To conclude, in my opinion the easiest/lower-risk/lower-overhead change here would be to switch the `forEach()` to do the stamped+read-lock combination for each entry, instead of the current per-section. That will ensure that the process function is always call without holding a lock on the map. It can lead to spurious inconsistencies during re-hashing time, but I don't think that will be a problem. And anyway that can be solved by requesting to behave like the current implementation for these specific cases. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
