Andrew Onischuk created AMBARI-24166:
----------------------------------------

             Summary: Metric Collector goes down after HDFS restart post EU
                 Key: AMBARI-24166
                 URL: https://issues.apache.org/jira/browse/AMBARI-24166
             Project: Ambari
          Issue Type: Bug
            Reporter: Andrew Onischuk
            Assignee: Andrew Onischuk
             Fix For: 2.7.0
         Attachments: AMBARI-24166.patch


**STR**

  1. Deployed cluster with Ambari version: 2.6.1.5-3 and HDP version: 
2.6.1.0-129
  2. Upgrade Ambari to Target Version: 2.7.0.0-709
  3. Upgrade AMS and Smartsense (keeping them stopped)
  4. Perform EU to HDP-3.0 and let it complete
  5. Restart HDFS
  6. Observe state of Metrics Collectors (AMS is configured in distributed mode)

**Result**  
Both metrics collectors are down (auto start is enabled for Metrics Collector)

>From logs:

    
    
    
    2018-06-13 16:45:05,620 ERROR 
org.apache.ambari.metrics.core.timeline.discovery.TimelineMetricMetadataManager:
 TimelineMetricMetadataKey is null for : [-8, 31, -72, 32, 88, -8, -51, -88, 
-104, 12, -123, 99, 55, -90, 45, -12, 115, 0, -6, 13]
    2018-06-13 16:45:05,622 WARN 
org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
    java.lang.NullPointerException
            at 
org.apache.ambari.metrics.core.timeline.aggregators.TimelineMetricReadHelper.getTimelineMetricCommonsFromResultSet(TimelineMetricReadHelper.java:116)
            at 
org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLastTimelineMetricFromResultSet(PhoenixHBaseAccessor.java:446)
            at 
org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLatestMetricRecords(PhoenixHBaseAccessor.java:1134)
            at 
org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getMetricRecords(PhoenixHBaseAccessor.java:953)
            at 
org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.getTimelineMetrics(HBaseTimelineMetricsService.java:288)
            at 
org.apache.ambari.metrics.webapp.TimelineWebServices.getTimelineMetrics(TimelineWebServices.java:261)
            at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
            at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
    
    2018-06-13 16:45:07,887 INFO org.apache.zookeeper.ZooKeeper: Initiating 
client connection, 
connectString=ctr-e138-1518143905142-361872-01-000005.hwx.site:2181,ctr-e138-1518143905142-361872-01-000006.hwx.site:2181,ctr-e138-1518143905142-361872-01-000003.hwx.site:2181
 sessionTimeout=120000 
watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$13/572967831@60474c94
    2018-06-13 16:45:07,889 INFO 
org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL 
mechanism.
    2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server 
ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181. Will 
attempt to SASL-authenticate using Login Context section 'Client'
    2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Socket 
connection established to 
ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, initiating 
session
    2018-06-13 16:45:07,894 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 
ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, sessionid 
= 0x363f94c8d6d0059, negotiated timeout = 90000
    2018-06-13 16:45:11,938 INFO 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=6, 
retries=6, started=4153 ms ago, cancelled=false, msg=Call to 
ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed on 
connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 
ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, 
details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, 
hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963, 
seqNum=-1
    2018-06-13 16:45:15,954 INFO 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=7, 
retries=7, started=8169 ms ago, cancelled=false, msg=Call to 
ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed on 
local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server 
is in the failed servers list: 
ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, 
details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at 
region=hbase:meta,,1.1588230740, 
hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963, 
seqNum=-1
    





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to