[ https://issues.apache.org/jira/browse/HDFS-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906483#comment-16906483 ]
Daryn Sharp commented on HDFS-14277: ------------------------------------ Due to internal branch conflicts I had to take an interrupt to review the observer patch to verify "it won't hurt you if you don't use it". Just reviewing the server showed a server-side {{AlignmentContext}} is used regardless of whether the feature is enabled. Red flag. I could easily tell {{GlobalStateIdContext#isCoordinatedCall}} would be a performance bottleneck due to the expensive {{Class.getCanonicalName}}, and the hash lookup of the method for feature that isn't even being used. Another red flag. Dismayed to discover it's been a known issue for 6 months yet it was back ported to branch-2. The worst offender is in {{GlobalStateIdContext#receiveRequestState}} which calls: {code} /** * This method holds a lock of FSEditLog to get the correct value. * This method must not be used for metrics. */ public long getCorrectLastAppliedOrWrittenTxId() { {code} *The IPC readers are SYNCHRONIZING on the edit log even when the feature is not enabled*. > [SBN read] Observer benchmark results > ------------------------------------- > > Key: HDFS-14277 > URL: https://issues.apache.org/jira/browse/HDFS-14277 > Project: Hadoop HDFS > Issue Type: Task > Components: ha, namenode > Affects Versions: 3.3.0 > Environment: Hardware: 4-node cluster, each node has 4 core, Xeon > 2.5Ghz, 25GB memory. > Software: CentOS 7.4, CDH 6.0 + Consistent Reads from Standby, Kerberos, SSL, > RPC encryption + Data Transfer Encryption, Cloudera Navigator. > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Priority: Major > Attachments: Observer profiler.png, Screen Shot 2019-02-14 at > 11.50.37 AM.png, observer RPC queue processing time.png > > > Ran a few benchmarks and profiler (VisualVM) today on an Observer-enabled > cluster. Would like to share the results with the community. The cluster has > 1 Observer node. > h2. NNThroughputBenchmark > Generate 1 million files and send fileStatus RPCs. > {code:java} > hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs > <namenode> -op fileStatus -threads 100 -files 1000000 -useExisting > -keepResults > {code} > h3. Kerberos, SSL, RPC encryption, Data Transfer Encryption enabled: > ||Node||fileStatus (Ops per sec)|| > |Active NameNode|4865| > |Observer|3996| > h3. Kerberos, SSL: > ||Node||fileStatus (Ops per sec)|| > |Active NameNode|7078| > |Observer|6459| > Observation: > * due to the edit tailing overhead, Observer node consume 30% CPU > utilization even if the cluster is idle. > * While Active NN has less than 1ms RPC processing time, Observer node has > > 5ms RPC processing time. I am still looking for the source of the longer > processing time. The longer RPC processing time may be the cause for the > performance degradation compared to that of Active NN. Note the cluster has > Cloudera Navigator installed which adds additional overhead to RPC processing > time. > * {{GlobalStateIdContext#isCoordinatedCall()}} pops up as one of the top > hotspots in the profiler. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org