I found the problem with the global lock some time ago and mentioned in on the Jini-Users list I believe. I made changes my self after meeting no real interest in solving the problem to use a finer grain locking strategy and that does work to tremendously reduce the contention at that point. This allows non-broken classloading to go on when a class loader is slow to respond or it's DNS is slow to respond.
Gregg Wonderly On Apr 11, 2011, at 9:59 AM, Christopher Dolan wrote: > I recently found the root cause of a long-standing performance problem > with Reggie that we've suffered for years. Our djinns may have 10,000 > services registered, so when Reggie boots up cold it gets slammed with > thousands of TCP requests via LookupLocatorDiscovery, > JoinManager.register() and ServiceDiscoveryManager.lookup(). In theory, > this should be supportable because Reggie's read/write priority lock is > pretty efficient, but two big technical complications have harmed our > ability to scale: > > > > 1) PreferredClassProvider.lookupLoader() has a global lock. Behind > that lock, URLClassLoaders are built which may trigger SocketPermission > checks. That SocketPermission causes a reverse DNS lookup in > getCanonName() because of the default Sun JRE lib/security/java.policy > line: > > permission java.net.SocketPermission "localhost:1024-", "listen"; > > Because PolicyFile.add() prepends, this check is evaluated first even if > you have local permissions that are more liberal. A handful of clients > with bad DNS configurations can cause long timeouts that stall the whole > process, causing eventual OutOfMemoryErrors because requests arrive > faster than they can be fulfilled. > > > > Possible code solutions (aside from fixing DNS configuration, of > course): > > a) switch PreferredClassProvider to a finer-grained lock (use the global > lock to lookup the fine lock, and only hold the fine lock while doing > creating the class loaders) > > b) defer some of the class loader construction so the DNS lookups happen > after the PreferredClassProvider lock is released > > c) implement a replacement for SocketPermission and/or > PermissionCollection which is smarter about the order it checks > permissions to minimize the number DNS lookups > > > > 2) When Reggie shuts down and then restarts, it accidentally > synchronizes all of the remote LookupLocatorDiscovery, who may restart > their polling WakeupManagers at the same time. What we see is that > several thousand TCP connections are all initiated within a few seconds > of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime > values. When/if these unicast connections succeed, then we see thousands > more TCP connections from JoinManager hitting Reggie in a giant wave. > In VisualVM's performance graphs, I see Reggie go from 100 threads to > 3000 threads in a couple of seconds, for example. > > > > Possible code solutions: > > a) add a random nudge to the polling interval in LookupLocatorDiscovery, > like the unicastDelayRange in the LocatorDiscovery class. This would > gradually desynchronize the clients > > b) likewise for JoinManager, perhaps > > > > > > These conditions are hard to reproduce in a typical lab, because they > require large numbers of machines and deliberately misconfigured DNS. > I'd appreciate any thoughts that others have about Reggie scaling > issues. > > > > Chris >