Wei-Chiu Chuang created HADOOP-16284:
----------------------------------------

             Summary: KMS Cache Miss Storm
                 Key: HADOOP-16284
                 URL: https://issues.apache.org/jira/browse/HADOOP-16284
             Project: Hadoop Common
          Issue Type: Bug
          Components: kms
    Affects Versions: 2.6.0
         Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server
            Reporter: Wei-Chiu Chuang


We recently stumble upon a performance issue with KMS, where occasionally it 
exhibited "No content to map" error (this cluster ran an old version that 
doesn't have HADOOP-14841) and jobs crashed. *We bumped the number of KMSes 
from 2 to 4, and situation went even worse.*

Later, we realized this cluster had a few hundred encryption zones and a few 
hundred encryption keys. This is pretty unusual because most of the deployments 
known to us has at most a dozen keys. So in terms of number of keys, this 
cluster is 1-2 order of magnitude higher than any one else.

The high number of encryption keys in creases the likelihood of key cache miss 
in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its 
backend, the Cloudera Keytrustee Server. Plus the high number of KMSes 
amplifies the latency, effectively causing a [cache miss 
storm|https://en.wikipedia.org/wiki/Cache_stampede].

We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will 
come up with a better name later surely - and discovered a scalability bug in 
CKTS. The fix was verified again with the tool.

Filing this bug so the community is aware of this issue. I don't have a 
solution for now in KMS. But we want to address this scalability problem in the 
near future because we are seeing use cases that requires thousands of 
encryption keys.
----
On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent 
fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for 
cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote 
cluster. I imagine similar issues exist for other execution engines, but I 
didn't test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to