[ https://issues.apache.org/jira/browse/IMPALA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807975#comment-17807975 ]
Quanlong Huang commented on IMPALA-12699: ----------------------------------------- It seems enabling keepAlive can't help since the client is hanging in receiving the response, i.e. the connection is not idle in client's perspective. I can reproduce the hung issue by the following steps. # Start impala cluster with LocalCatalog mode enabled. Also enable rpc-trace logging on catalogd. {noformat} bin/start-impala-cluster.py --catalogd_args="--catalog_topic_mode=minimal -vmodule=rpc-trace=2" --impalad_args=--use_local_catalog {noformat} # Run a query so impalad will create a connection to catalogd {noformat} impala-shell> show tables;{noformat} # Get the port of the connection. On some OSes, 26000 is shown as "quake" so grep "quake" in such case. {noformat} $ ss -ap | grep 26000 tcp LISTEN 0 128 0.0.0.0:26000 0.0.0.0:* users:(("catalogd",pid=18014,fd=409)) tcp ESTAB 0 0 127.0.0.1:26000 127.0.0.1:38130 users:(("catalogd",pid=18014,fd=390)) tcp ESTAB 0 0 127.0.0.1:38130 127.0.0.1:26000 users:(("impalad",pid=18064,fd=446)){noformat} # Use gdb to attach to catalogd and set a breakpoint at the place it receives the GetPartialCatalog request. Then resume the process. Keep pressing ENTER if gdb stop at segmentation faults from JVM {noformat} sudo gdb -p `pidof catalogd` (gdb) b impala::CatalogServiceThriftIf::GetPartialCatalogObject Breakpoint 1 at 0x100f5e8: file /home/quanlong/workspace/Impala/be/src/catalog/catalog-server.cc, line 300. (gdb) c Continuing. [New Thread 0x7fd56667c700 (LWP 21467)] [Thread 0x7fd562871700 (LWP 21459) exited] Thread 25 "C1 CompilerThre" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fd56d524700 (LWP 18163)] 0x00007fd598d8d836 in ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, int, bool, bool, RTMState) () from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so (gdb) c Continuing. [Thread 0x7fd56657b700 (LWP 21454) exited] [New Thread 0x7fd562871700 (LWP 21528)] [New Thread 0x7fd56657b700 (LWP 21553)] [Thread 0x7fd56657b700 (LWP 21553) exited] # Now it's running{noformat} # Run a query on an unloaded table in impala-shell {noformat} impala-shell> desc functional.alltypes;{noformat} # The gdb session of catalogd will stop at the breakpoint {noformat} Thread 66 "catalogd" hit Breakpoint 1, impala::CatalogServiceThriftIf::GetPartialCatalogObject (this=0xc607eb0, resp=..., req=...) at /home/quanlong/workspace/Impala/be/src/catalog/catalog-server.cc:300 300 void GetPartialCatalogObject(TGetPartialCatalogObjectResponse& resp, (gdb) # Don't quit here {noformat} # Use iptables to log and drop packages sent to the impalad socket port {noformat} sudo iptables -A INPUT -p tcp -m tcp --dport 38130 -j LOG --log-prefix "CATALOG_PKG: " sudo iptables -A INPUT -p tcp -m tcp --dport 38130 -j DROP {noformat} # Quit the gdb session to resume catalogd {noformat} (gdb) quit A debugging session is active. Inferior 1 [process 18014] will be detached. Quit anyway? (y or n) y Detaching from program: /home/quanlong/workspace/Impala/be/build/debug/service/impalad, process 18014 [Inferior 1 (process 18014) detached]{noformat} # Wait around 15 minutes, catalogd will show logs when it closed the connection {noformat} I0118 10:30:27.898712 20847 thrift-util.cc:198] TAcceptQueueServer client died: THRIFT_ETIMEDOUT{noformat} # Only impalad has the connection opened {noformat} $ ss -ap | grep 26000 tcp LISTEN 0 128 0.0.0.0:26000 0.0.0.0:* users:(("catalogd",pid=18014,fd=409)) tcp ESTAB 0 0 127.0.0.1:38130 127.0.0.1:26000 users:(("impalad",pid=18064,fd=446)){noformat} # We can check the dropped packages by "grep CATALOG_PKG /var/log/syslog" # Delete the rules in iptables {noformat} $ sudo iptables -L INPUT -n --line-numbers Chain INPUT (policy ACCEPT) num target prot opt source destination 1 LOG tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:38130 LOG flags 0 level 4 prefix "CATALOG_PKG: " 2 DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:38130 $ sudo iptables -D INPUT 2 $ sudo iptables -D INPUT 1{noformat} # The query in impalad still hangs even thought the network recovered > Coordinator should retry GetPartialCatalogObject request and apply a recv > timeout > --------------------------------------------------------------------------------- > > Key: IMPALA-12699 > URL: https://issues.apache.org/jira/browse/IMPALA-12699 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Reporter: Quanlong Huang > Priority: Critical > > We have seen trivial GetPartialCatalogObject RPCs hanging in coordinator > side, e.g. IMPALA-11409. Due to the piggyback mechanism of fetching metadata > in local-catalog mode (see IMPALA-7534 or comments in > CatalogdMetaProvider#loadWithCaching()), a hanging RPC on shared metadata > (e.g. db list or table list of a db) could block other queries. > We have also seen thrift RPCs hanging in IMPALA-3575. In fact, > GetPartialCatalogObject RPCs are read-only requests. They can be cleanly > retried. We should consider using a dedicated catalogd client cache for > GetPartialCatalogObject requests and set an appropriate timeout for the > socket. > The current catalogd client cache: > https://github.com/apache/impala/blob/cdac777c51febc99500b8426c2b3aabc7e9addd7/be/src/runtime/exec-env.cc#L224-L226 > The related flags: > https://github.com/apache/impala/blob/cdac777c51febc99500b8426c2b3aabc7e9addd7/be/src/runtime/exec-env.cc#L161-L167 > CC [~wzhou] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org