[jira] [Updated] (IMPALA-13132) Ozone jobs see intermittent termination of Ozone manager / HMS fails to start

jiangwei (Jira) Mon, 04 Nov 2024 02:33:07 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


jiangwei updated IMPALA-13132:
------------------------------
    Description: 
Ozone 作业在数据加载期间加载数据/元数据快照，然后重新启动集群。在此重新启动时，HMS 有时无法启动：
{noformat}
16:04:13  --> Starting Hive Metastore Service
16:04:13 No handlers could be found for logger "thrift.transport.TSocket"
16:04:14 Waiting for the Metastore at localhost:9083...
...
16:09:14 Waiting for the Metastore at localhost:9083...
16:09:14 Metastore service failed to start within 300.0 seconds.{noformat}
在 metastore 日志中，我们看到如下消息：
{noformat}
2024-06-04T08:37:06,425  INFO [main] retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
hostname/127.0.0.1 to localhost:9862 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
$Proxy31.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 
failover attempts. Trying to failover after sleeping for 4000ms.{noformat}
它正在尝试与 Ozone 管理器对话。在尝试启动 HMS 之前，Ozone 集群已备份并运行，但随后 Ozone 管理器收到信号并关闭：
{noformat}
24/06/04 08:36:37 ERROR om.OzoneManagerStarter: RECEIVED SIGNAL 15: SIGTERM
24/06/04 08:36:37 INFO om.OzoneManagerStarter: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down OzoneManager at hostname/127.0.0.1
************************************************************/
24/06/04 08:36:37 INFO om.OzoneManager: om1[localhost:9862]: Stopping Ozone 
Manager{noformat}

  was:
Ozone jobs load data/metadata snapshots during dataload, then restarts the 
cluster. On this restart, the HMS sometimes fails to come up:
{noformat}
16:04:13  --> Starting Hive Metastore Service
16:04:13 No handlers could be found for logger "thrift.transport.TSocket"
16:04:14 Waiting for the Metastore at localhost:9083...
...
16:09:14 Waiting for the Metastore at localhost:9083...
16:09:14 Metastore service failed to start within 300.0 seconds.{noformat}
In the metastore logs, we see messages like this:
{noformat}
2024-06-04T08:37:06,425  INFO [main] retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
hostname/127.0.0.1 to localhost:9862 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
$Proxy31.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 
failover attempts. Trying to failover after sleeping for 4000ms.{noformat}
It's trying to talk to the Ozone manager. The Ozone cluster was back up and 
running before trying to start the HMS, but then the Ozone manager received a 
signal and shutdown:
{noformat}
24/06/04 08:36:37 ERROR om.OzoneManagerStarter: RECEIVED SIGNAL 15: SIGTERM
24/06/04 08:36:37 INFO om.OzoneManagerStarter: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down OzoneManager at hostname/127.0.0.1
************************************************************/
24/06/04 08:36:37 INFO om.OzoneManager: om1[localhost:9862]: Stopping Ozone 
Manager{noformat}


> Ozone jobs see intermittent termination of Ozone manager / HMS fails to start
> -----------------------------------------------------------------------------
>
>                 Key: IMPALA-13132
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13132
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 4.5.0
>            Reporter: Joe McDonnell
>            Assignee: Michael Smith
>            Priority: Critical
>              Labels: broken-build, flaky
>             Fix For: Impala 4.5.0
>
>
> Ozone 作业在数据加载期间加载数据/元数据快照，然后重新启动集群。在此重新启动时，HMS 有时无法启动：
> {noformat}
> 16:04:13  --> Starting Hive Metastore Service
> 16:04:13 No handlers could be found for logger "thrift.transport.TSocket"
> 16:04:14 Waiting for the Metastore at localhost:9083...
> ...
> 16:09:14 Waiting for the Metastore at localhost:9083...
> 16:09:14 Metastore service failed to start within 300.0 seconds.{noformat}
> 在 metastore 日志中，我们看到如下消息：
> {noformat}
> 2024-06-04T08:37:06,425  INFO [main] retry.RetryInvocationHandler: 
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> hostname/127.0.0.1 to localhost:9862 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy31.submitRequest over nodeId=null,nodeAddress=localhost:9862 after 1 
> failover attempts. Trying to failover after sleeping for 4000ms.{noformat}
> 它正在尝试与 Ozone 管理器对话。在尝试启动 HMS 之前，Ozone 集群已备份并运行，但随后 Ozone 管理器收到信号并关闭：
> {noformat}
> 24/06/04 08:36:37 ERROR om.OzoneManagerStarter: RECEIVED SIGNAL 15: SIGTERM
> 24/06/04 08:36:37 INFO om.OzoneManagerStarter: SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down OzoneManager at hostname/127.0.0.1
> ************************************************************/
> 24/06/04 08:36:37 INFO om.OzoneManager: om1[localhost:9862]: Stopping Ozone 
> Manager{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-13132) Ozone jobs see intermittent termination of Ozone manager / HMS fails to start

Reply via email to