Sahil Takiar created IMPALA-9610:
------------------------------------
Summary: Crashed statestore causes queries to hang
Key: IMPALA-9610
URL: https://issues.apache.org/jira/browse/IMPALA-9610
Project: IMPALA
Issue Type: Bug
Reporter: Sahil Takiar
Assignee: Sahil Takiar
The following steps on master cause a query to hang:
{code:java}
./bin/start-impala-cluster.py
ps aux | grep -i statestore
kill -9 8968
# wait until the impala coordinator has noticed that the statestore has been
killed (e.g. wait for statestore-subscriber.connected to be False)
./bin/impala-shell.sh
[localhost:21000] default> select count(l_comment) from tpch.lineitem;
{code}
The query hangs in the "CREATED" state.
>From the coordinator logs:
{code:java}
I0406 16:46:51.677052 10058 statestore-subscriber.cc:300]
impalad@stakiar-desktop:22000: Connection with statestore lost, entering
recovery mode
I0406 16:46:51.677093 10058 statestore-subscriber.cc:304] Trying to re-register
with statestore, attempt: 1
I0406 16:46:51.677412 10058 client-cache.cc:82] ReopenClient(): re-creating
client for localhost:24000
I0406 16:46:51.677911 10058 thrift-util.cc:96] TSocket::open() connect() <Host:
localhost Port: 24000>Connection refused
I0406 16:46:51.678071 10058 thrift-client.cc:78] Couldn't open transport for
localhost:24000 (connect() failed: Connection refused)
I0406 16:46:51.678092 10058 thrift-client.cc:94] Unable to connect to
localhost:24000
I0406 16:46:51.678114 10058 client-cache.h:387] RPC client failed to connect:
Couldn't open transport for localhost:24000 (connect() failed: Connection
refused)
rpc: N6impala27TRegisterSubscriberResponseE
I0406 16:46:51.678135 10058 client-cache.h:314] RPC to localhost:24000 failed
RPC client failed to connect: Couldn't open transport for localhost:24000
(connect() failed: Connection refused)
I0406 16:46:51.678151 10058 client-cache.cc:174] Broken Connection, destroy
client for localhost:24000
I0406 16:46:54.678651 10058 thrift-util.cc:96] TSocket::open() connect() <Host:
localhost Port: 24000>Connection refused
I0406 16:46:54.678774 10058 thrift-client.cc:78] Couldn't open transport for
localhost:24000 (connect() failed: Connection refused)
I0406 16:46:54.678794 10058 thrift-client.cc:94] Unable to connect to
localhost:24000
...
{code}
The "Unable to connect to localhost:24000" pattern continues indefinitely
(presumably until the statestore comes back up).
When the coordinator gets a query from the client, it is able to successfully
parse / plan it, but hangs when trying to load metadata:
{code:java}
I0406 16:47:11.461125 11181 impala-server.cc:1053]
3c4fe54fcc25e7db:99c9bd7b00000000] Registered query
query_id=3c4fe54fcc25e7db:99c9bd7b00000000
session_id=d8416b340cdf66b6:fc432571b1d658bb
I0406 16:47:11.521435 11181 Frontend.java:1490]
3c4fe54fcc25e7db:99c9bd7b00000000] Analyzing query: select count(l_comment)
from tpch.lineitem db: default
I0406 16:47:11.627283 11181 FeSupport.java:317]
3c4fe54fcc25e7db:99c9bd7b00000000] Requesting prioritized load of table(s):
tpch.lineitem
...
I0406 16:47:31.648221 11181 StmtMetadataLoader.java:202]
3c4fe54fcc25e7db:99c9bd7b00000000] Waiting for table metadata. Waited for 10
catalog updates and 20020ms. Tables remaining: [tpch.lineitem]
...
I0406 16:52:31.696449 11181 StmtMetadataLoader.java:202]
3c4fe54fcc25e7db:99c9bd7b00000000] Waiting for table metadata. Waited for 160
catalog updates and 320069ms. Tables remaining: [tpch.lineitem]
....
I0406 16:52:33.697089 11181 StmtMetadataLoader.java:223]
3c4fe54fcc25e7db:99c9bd7b00000000] Re-sending prioritized load request. Waited
for 160 catalog updates and 322069ms.
I0406 16:52:33.697333 11181 FeSupport.java:317]
3c4fe54fcc25e7db:99c9bd7b00000000] Requesting prioritized load of table(s):
tpch.lineitem
{code}
The catalogd logs show the following:
{code:java}
I0406 16:45:58.999001 9032 logging-support.cc:299] Old log file deleted during
log rotation:
/home/stakiar/Impala/logs/cluster/catalogd.stakiar-desktop.stakiar.log.ERROR.20200406-095134.8408
I0406 16:46:51.311251 9241 statestore-subscriber.cc:300]
catalog-server@stakiar-desktop:26000: Connection with statestore lost, entering
recovery mode
I0406 16:46:51.311297 9241 statestore-subscriber.cc:304] Trying to re-register
with statestore, attempt: 1
I0406 16:46:51.311861 9241 client-cache.cc:82] ReopenClient(): re-creating
client for localhost:24000
I0406 16:46:51.312443 9241 thrift-util.cc:96] TSocket::open() connect() <Host:
localhost Port: 24000>Connection refused
I0406 16:46:51.312660 9241 thrift-client.cc:78] Couldn't open transport for
localhost:24000 (connect() failed: Connection refused)
I0406 16:46:51.312695 9241 thrift-client.cc:94] Unable to connect to
localhost:24000
I0406 16:46:51.312721 9241 client-cache.h:387] RPC client failed to connect:
Couldn't open transport for localhost:24000 (connect() failed: Connection
refused)
rpc: N6impala27TRegisterSubscriberResponseE
I0406 16:46:51.312741 9241 client-cache.h:314] RPC to localhost:24000 failed
RPC client failed to connect: Couldn't open transport for localhost:24000
(connect() failed: Connection refused)
I0406 16:46:51.312757 9241 client-cache.cc:174] Broken Connection, destroy
client for localhost:24000
I0406 16:46:54.313377 9241 thrift-util.cc:96] TSocket::open() connect() <Host:
localhost Port: 24000>Connection refused
I0406 16:46:54.313537 9241 thrift-client.cc:78] Couldn't open transport for
localhost:24000 (connect() failed: Connection refused)
I0406 16:46:54.313561 9241 thrift-client.cc:94] Unable to connect to
localhost:24000
...
I0406 16:47:11.642450 13461 TableLoader.java:60] Loading metadata for:
tpch.lineitem (background load)
I0406 16:47:11.642679 9211 TableLoadingMgr.java:71] Loading metadata for
table: tpch.lineitem
I0406 16:47:11.642757 9211 TableLoadingMgr.java:73] Remaining items in queue:
0. Loads in progress: 1
I0406 16:47:11.668300 13461 HdfsTable.java:1004] Loading metadata for table
definition and all partition(s) of tpch.lineitem (background load)
W0406 16:47:11.690783 13461 HiveConf.java:5706] HiveConf of name
hive.sentry.conf.url does not exist
I0406 16:47:11.815637 13461 Table.java:420] Load Valid Write Id List Done. Time
taken: 15.826us
I0406 16:47:11.815695 13461 HdfsTable.java:1037] Fetching partition metadata
from the Metastore: tpch.lineitem
I0406 16:47:11.822021 13461 HdfsTable.java:1044] Fetched partition metadata
from the Metastore: tpch.lineitem
I0406 16:47:11.903692 13461 HdfsTable.java:700] Loaded file and block metadata
for tpch.lineitem partitions: . Time taken: 33.599ms
I0406 16:47:11.903905 13461 TableLoader.java:103] Loaded metadata for:
tpch.lineitem (261ms)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)