[ 
https://issues.apache.org/jira/browse/HDFS-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906483#comment-16906483
 ] 

Daryn Sharp commented on HDFS-14277:
------------------------------------

Due to internal branch conflicts I had to take an interrupt to review the 
observer patch to verify "it won't hurt you if you don't use it".  Just 
reviewing the server showed a server-side {{AlignmentContext}} is used 
regardless of whether the feature is enabled.  Red flag.

I could easily tell {{GlobalStateIdContext#isCoordinatedCall}} would be a 
performance bottleneck due to the expensive {{Class.getCanonicalName}}, and the 
hash lookup of the method for feature that isn't even being used.  Another red 
flag.  Dismayed to discover it's been a known issue for 6 months yet it was 
back ported to branch-2.

The worst offender is in {{GlobalStateIdContext#receiveRequestState}} which 
calls:
{code}
  /**
   * This method holds a lock of FSEditLog to get the correct value.
   * This method must not be used for metrics.
   */
  public long getCorrectLastAppliedOrWrittenTxId() {
{code}
*The IPC readers are SYNCHRONIZING on the edit log even when the feature is not 
enabled*.

> [SBN read] Observer benchmark results
> -------------------------------------
>
>                 Key: HDFS-14277
>                 URL: https://issues.apache.org/jira/browse/HDFS-14277
>             Project: Hadoop HDFS
>          Issue Type: Task
>          Components: ha, namenode
>    Affects Versions: 3.3.0
>         Environment: Hardware: 4-node cluster, each node has 4 core, Xeon 
> 2.5Ghz, 25GB memory.
> Software: CentOS 7.4, CDH 6.0 + Consistent Reads from Standby, Kerberos, SSL, 
> RPC encryption + Data Transfer Encryption, Cloudera Navigator.
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: Observer profiler.png, Screen Shot 2019-02-14 at 
> 11.50.37 AM.png, observer RPC queue processing time.png
>
>
> Ran a few benchmarks and profiler (VisualVM) today on an Observer-enabled 
> cluster. Would like to share the results with the community. The cluster has 
> 1 Observer node.
> h2. NNThroughputBenchmark
> Generate 1 million files and send fileStatus RPCs.
> {code:java}
> hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
> <namenode>  -op fileStatus -threads 100 -files 1000000 -useExisting 
> -keepResults
> {code}
> h3. Kerberos, SSL, RPC encryption, Data Transfer Encryption enabled:
> ||Node||fileStatus (Ops per sec)||
> |Active NameNode|4865|
> |Observer|3996|
> h3. Kerberos, SSL:
> ||Node||fileStatus (Ops per sec)||
> |Active NameNode|7078|
> |Observer|6459|
> Observation:
>  * due to the edit tailing overhead, Observer node consume 30% CPU 
> utilization even if the cluster is idle.
>  * While Active NN has less than 1ms RPC processing time, Observer node has > 
> 5ms RPC processing time. I am still looking for the source of the longer 
> processing time. The longer RPC processing time may be the cause for the 
> performance degradation compared to that of Active NN. Note the cluster has 
> Cloudera Navigator installed which adds additional overhead to RPC processing 
> time.
>  * {{GlobalStateIdContext#isCoordinatedCall()}} pops up as one of the top 
> hotspots in the profiler. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to