[
https://issues.apache.org/jira/browse/IMPALA-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar resolved IMPALA-9610.
----------------------------------
Resolution: Not A Problem
> Crashed statestore causes queries to hang
> -----------------------------------------
>
> Key: IMPALA-9610
> URL: https://issues.apache.org/jira/browse/IMPALA-9610
> Project: IMPALA
> Issue Type: Bug
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Critical
>
> The following steps on master cause a query to hang:
> {code:java}
> ./bin/start-impala-cluster.py
> ps aux | grep -i statestore
> kill -9 8968
> # wait until the impala coordinator has noticed that the statestore has been
> killed (e.g. wait for statestore-subscriber.connected to be False)
> ./bin/impala-shell.sh
> [localhost:21000] default> select count(l_comment) from tpch.lineitem;
> {code}
> The query hangs in the "CREATED" state.
> From the coordinator logs:
> {code:java}
> I0406 16:46:51.677052 10058 statestore-subscriber.cc:300]
> impalad@stakiar-desktop:22000: Connection with statestore lost, entering
> recovery mode
> I0406 16:46:51.677093 10058 statestore-subscriber.cc:304] Trying to
> re-register with statestore, attempt: 1
> I0406 16:46:51.677412 10058 client-cache.cc:82] ReopenClient(): re-creating
> client for localhost:24000
> I0406 16:46:51.677911 10058 thrift-util.cc:96] TSocket::open() connect()
> <Host: localhost Port: 24000>Connection refused
> I0406 16:46:51.678071 10058 thrift-client.cc:78] Couldn't open transport for
> localhost:24000 (connect() failed: Connection refused)
> I0406 16:46:51.678092 10058 thrift-client.cc:94] Unable to connect to
> localhost:24000
> I0406 16:46:51.678114 10058 client-cache.h:387] RPC client failed to connect:
> Couldn't open transport for localhost:24000 (connect() failed: Connection
> refused)
> rpc: N6impala27TRegisterSubscriberResponseE
> I0406 16:46:51.678135 10058 client-cache.h:314] RPC to localhost:24000 failed
> RPC client failed to connect: Couldn't open transport for localhost:24000
> (connect() failed: Connection refused)
> I0406 16:46:51.678151 10058 client-cache.cc:174] Broken Connection, destroy
> client for localhost:24000
> I0406 16:46:54.678651 10058 thrift-util.cc:96] TSocket::open() connect()
> <Host: localhost Port: 24000>Connection refused
> I0406 16:46:54.678774 10058 thrift-client.cc:78] Couldn't open transport for
> localhost:24000 (connect() failed: Connection refused)
> I0406 16:46:54.678794 10058 thrift-client.cc:94] Unable to connect to
> localhost:24000
> ...
> {code}
> The "Unable to connect to localhost:24000" pattern continues indefinitely
> (presumably until the statestore comes back up).
> When the coordinator gets a query from the client, it is able to successfully
> parse / plan it, but hangs when trying to load metadata:
> {code:java}
> I0406 16:47:11.461125 11181 impala-server.cc:1053]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Registered query
> query_id=3c4fe54fcc25e7db:99c9bd7b00000000
> session_id=d8416b340cdf66b6:fc432571b1d658bb
> I0406 16:47:11.521435 11181 Frontend.java:1490]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Analyzing query: select count(l_comment)
> from tpch.lineitem db: default
> I0406 16:47:11.627283 11181 FeSupport.java:317]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Requesting prioritized load of table(s):
> tpch.lineitem
> ...
> I0406 16:47:31.648221 11181 StmtMetadataLoader.java:202]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Waiting for table metadata. Waited for 10
> catalog updates and 20020ms. Tables remaining: [tpch.lineitem]
> ...
> I0406 16:52:31.696449 11181 StmtMetadataLoader.java:202]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Waiting for table metadata. Waited for 160
> catalog updates and 320069ms. Tables remaining: [tpch.lineitem]
> ....
> I0406 16:52:33.697089 11181 StmtMetadataLoader.java:223]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Re-sending prioritized load request.
> Waited for 160 catalog updates and 322069ms.
> I0406 16:52:33.697333 11181 FeSupport.java:317]
> 3c4fe54fcc25e7db:99c9bd7b00000000] Requesting prioritized load of table(s):
> tpch.lineitem
> {code}
> The catalogd logs show the following:
> {code:java}
> I0406 16:45:58.999001 9032 logging-support.cc:299] Old log file deleted
> during log rotation:
> /home/stakiar/Impala/logs/cluster/catalogd.stakiar-desktop.stakiar.log.ERROR.20200406-095134.8408
> I0406 16:46:51.311251 9241 statestore-subscriber.cc:300]
> catalog-server@stakiar-desktop:26000: Connection with statestore lost,
> entering recovery mode
> I0406 16:46:51.311297 9241 statestore-subscriber.cc:304] Trying to
> re-register with statestore, attempt: 1
> I0406 16:46:51.311861 9241 client-cache.cc:82] ReopenClient(): re-creating
> client for localhost:24000
> I0406 16:46:51.312443 9241 thrift-util.cc:96] TSocket::open() connect()
> <Host: localhost Port: 24000>Connection refused
> I0406 16:46:51.312660 9241 thrift-client.cc:78] Couldn't open transport for
> localhost:24000 (connect() failed: Connection refused)
> I0406 16:46:51.312695 9241 thrift-client.cc:94] Unable to connect to
> localhost:24000
> I0406 16:46:51.312721 9241 client-cache.h:387] RPC client failed to connect:
> Couldn't open transport for localhost:24000 (connect() failed: Connection
> refused)
> rpc: N6impala27TRegisterSubscriberResponseE
> I0406 16:46:51.312741 9241 client-cache.h:314] RPC to localhost:24000 failed
> RPC client failed to connect: Couldn't open transport for localhost:24000
> (connect() failed: Connection refused)
> I0406 16:46:51.312757 9241 client-cache.cc:174] Broken Connection, destroy
> client for localhost:24000
> I0406 16:46:54.313377 9241 thrift-util.cc:96] TSocket::open() connect()
> <Host: localhost Port: 24000>Connection refused
> I0406 16:46:54.313537 9241 thrift-client.cc:78] Couldn't open transport for
> localhost:24000 (connect() failed: Connection refused)
> I0406 16:46:54.313561 9241 thrift-client.cc:94] Unable to connect to
> localhost:24000
> ...
> I0406 16:47:11.642450 13461 TableLoader.java:60] Loading metadata for:
> tpch.lineitem (background load)
> I0406 16:47:11.642679 9211 TableLoadingMgr.java:71] Loading metadata for
> table: tpch.lineitem
> I0406 16:47:11.642757 9211 TableLoadingMgr.java:73] Remaining items in
> queue: 0. Loads in progress: 1
> I0406 16:47:11.668300 13461 HdfsTable.java:1004] Loading metadata for table
> definition and all partition(s) of tpch.lineitem (background load)
> W0406 16:47:11.690783 13461 HiveConf.java:5706] HiveConf of name
> hive.sentry.conf.url does not exist
> I0406 16:47:11.815637 13461 Table.java:420] Load Valid Write Id List Done.
> Time taken: 15.826us
> I0406 16:47:11.815695 13461 HdfsTable.java:1037] Fetching partition metadata
> from the Metastore: tpch.lineitem
> I0406 16:47:11.822021 13461 HdfsTable.java:1044] Fetched partition metadata
> from the Metastore: tpch.lineitem
> I0406 16:47:11.903692 13461 HdfsTable.java:700] Loaded file and block
> metadata for tpch.lineitem partitions: . Time taken: 33.599ms
> I0406 16:47:11.903905 13461 TableLoader.java:103] Loaded metadata for:
> tpch.lineitem (261ms)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)