[ https://issues.apache.org/jira/browse/AMBARI-24166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Onischuk updated AMBARI-24166: ------------------------------------- Status: Patch Available (was: Open) > Metric Collector goes down after HDFS restart post EU > ----------------------------------------------------- > > Key: AMBARI-24166 > URL: https://issues.apache.org/jira/browse/AMBARI-24166 > Project: Ambari > Issue Type: Bug > Reporter: Andrew Onischuk > Assignee: Andrew Onischuk > Priority: Major > Fix For: 2.7.0 > > Attachments: AMBARI-24166.patch > > > **STR** > 1. Deployed cluster with Ambari version: 2.6.1.5-3 and HDP version: > 2.6.1.0-129 > 2. Upgrade Ambari to Target Version: 2.7.0.0-709 > 3. Upgrade AMS and Smartsense (keeping them stopped) > 4. Perform EU to HDP-3.0 and let it complete > 5. Restart HDFS > 6. Observe state of Metrics Collectors (AMS is configured in distributed > mode) > **Result** > Both metrics collectors are down (auto start is enabled for Metrics Collector) > From logs: > > > > 2018-06-13 16:45:05,620 ERROR > org.apache.ambari.metrics.core.timeline.discovery.TimelineMetricMetadataManager: > TimelineMetricMetadataKey is null for : [-8, 31, -72, 32, 88, -8, -51, -88, > -104, 12, -123, 99, 55, -90, 45, -12, 115, 0, -6, 13] > 2018-06-13 16:45:05,622 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.ambari.metrics.core.timeline.aggregators.TimelineMetricReadHelper.getTimelineMetricCommonsFromResultSet(TimelineMetricReadHelper.java:116) > at > org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLastTimelineMetricFromResultSet(PhoenixHBaseAccessor.java:446) > at > org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLatestMetricRecords(PhoenixHBaseAccessor.java:1134) > at > org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getMetricRecords(PhoenixHBaseAccessor.java:953) > at > org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.getTimelineMetrics(HBaseTimelineMetricsService.java:288) > at > org.apache.ambari.metrics.webapp.TimelineWebServices.getTimelineMetrics(TimelineWebServices.java:261) > at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > > 2018-06-13 16:45:07,887 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, > connectString=ctr-e138-1518143905142-361872-01-000005.hwx.site:2181,ctr-e138-1518143905142-361872-01-000006.hwx.site:2181,ctr-e138-1518143905142-361872-01-000003.hwx.site:2181 > sessionTimeout=120000 > watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$13/572967831@60474c94 > 2018-06-13 16:45:07,889 INFO > org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as > SASL mechanism. > 2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Opening > socket connection to server > ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181. Will > attempt to SASL-authenticate using Login Context section 'Client' > 2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to > ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, > initiating session > 2018-06-13 16:45:07,894 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server > ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, > sessionid = 0x363f94c8d6d0059, negotiated timeout = 90000 > 2018-06-13 16:45:11,938 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=6, retries=6, started=4153 ms ago, cancelled=false, msg=Call to > ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed > on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: > ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, > details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963, > seqNum=-1 > 2018-06-13 16:45:15,954 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=7, retries=7, started=8169 ms ago, cancelled=false, msg=Call to > ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed > on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This > server is in the failed servers list: > ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, > details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963, > seqNum=-1 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)