I have seen this when there are large volumes of members in hehe lap group or complex xml generated in ldap
sent from my mobile Daemeon C.M. Reiydelle USA 415.501.0198 London +44.0.20.8144.9872 On Oct 27, 2014 2:36 PM, "Chris Li (JIRA)" <[email protected]> wrote: > Chris Li created HADOOP-11238: > --------------------------------- > > Summary: Group cache expiry causes namenode slowdown > Key: HADOOP-11238 > URL: https://issues.apache.org/jira/browse/HADOOP-11238 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.5.1 > Reporter: Chris Li > Priority: Minor > > > Our namenode pauses for 12-60 seconds every hour or so. During these > pauses, no new requests can come in. > > Around the time of pauses, we have log messages such as: > 2014-10-22 13:24:22,688 WARN org.apache.hadoop.security.Groups: Potential > performance problem: getGroups(user=xxxxx) took 34507 milliseconds. > > The current theory is: > 1. Groups has a cache that is refreshed periodically. > 2. When the cache is cleared, we have a thundering herd effect which > overwhelms our LDAP servers (we are using ShellBasedUnixGroupsMapping with > sssd, how this happens has yet to be established) > 3. group resolution queries begin to take longer, I've observed it taking > 1.2 seconds instead of the usual 0.01-0.03 seconds when measuring in the > shell `time groups myself` > 4. If there is mutual exclusion somewhere along this path, a 1 second > pause could lead to a 60 second pause as all the threads compete for the > resource. The exact cause hasn't been established > > Potential solutions include: > 1. Increasing group cache time, which will make the issue less frequent > 2. Rolling evictions of the cache so we prevent the large spike in LDAP > queries > > > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
