[jira] [Commented] (IMPALA-12699) Coordinator should retry GetPartialCatalogObject request and apply a recv timeout

Quanlong Huang (Jira) Wed, 17 Jan 2024 18:48:18 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807975#comment-17807975
 ]


Quanlong Huang commented on IMPALA-12699:
-----------------------------------------

It seems enabling keepAlive can't help since the client is hanging in receiving 
the response, i.e. the connection is not idle in client's perspective.

I can reproduce the hung issue by the following steps.
 # Start impala cluster with LocalCatalog mode enabled. Also enable rpc-trace 
logging on catalogd.
{noformat}
bin/start-impala-cluster.py --catalogd_args="--catalog_topic_mode=minimal 
-vmodule=rpc-trace=2" --impalad_args=--use_local_catalog {noformat}
 # Run a query so impalad will create a connection to catalogd
{noformat}
impala-shell> show tables;{noformat}
 # Get the port of the connection. On some OSes, 26000 is shown as "quake" so 
grep "quake" in such case.
{noformat}
$ ss -ap | grep 26000
tcp  LISTEN     0      128                                              
0.0.0.0:26000                                               0.0.0.0:*           
         users:(("catalogd",pid=18014,fd=409))                                  
        
tcp  ESTAB      0      0                                              
127.0.0.1:26000                                             127.0.0.1:38130     
           users:(("catalogd",pid=18014,fd=390))                                
          
tcp  ESTAB      0      0                                              
127.0.0.1:38130                                             127.0.0.1:26000     
           users:(("impalad",pid=18064,fd=446)){noformat}
 # Use gdb to attach to catalogd and set a breakpoint at the place it receives 
the GetPartialCatalog request. Then resume the process. Keep pressing ENTER if 
gdb stop at segmentation faults from JVM
{noformat}
sudo gdb -p `pidof catalogd`
(gdb) b impala::CatalogServiceThriftIf::GetPartialCatalogObject
Breakpoint 1 at 0x100f5e8: file 
/home/quanlong/workspace/Impala/be/src/catalog/catalog-server.cc, line 300.
(gdb) c
Continuing.
[New Thread 0x7fd56667c700 (LWP 21467)]
[Thread 0x7fd562871700 (LWP 21459) exited]

Thread 25 "C1 CompilerThre" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fd56d524700 (LWP 18163)]
0x00007fd598d8d836 in ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, 
CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, 
AbstractCompiler*, int, bool, bool, RTMState) ()
   from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
(gdb) c
Continuing.
[Thread 0x7fd56657b700 (LWP 21454) exited]
[New Thread 0x7fd562871700 (LWP 21528)]
[New Thread 0x7fd56657b700 (LWP 21553)]
[Thread 0x7fd56657b700 (LWP 21553) exited]
# Now it's running{noformat}
# Run a query on an unloaded table in impala-shell
{noformat}
impala-shell> desc functional.alltypes;{noformat}
# The gdb session of catalogd will stop at the breakpoint
{noformat}
Thread 66 "catalogd" hit Breakpoint 1, 
impala::CatalogServiceThriftIf::GetPartialCatalogObject (this=0xc607eb0, 
resp=..., req=...) at 
/home/quanlong/workspace/Impala/be/src/catalog/catalog-server.cc:300
300       void GetPartialCatalogObject(TGetPartialCatalogObjectResponse& resp,
(gdb) # Don't quit here {noformat}
# Use iptables to log and drop packages sent to the impalad socket port
{noformat}
sudo iptables -A INPUT -p tcp -m tcp --dport 38130 -j LOG --log-prefix 
"CATALOG_PKG: "
sudo iptables -A INPUT -p tcp -m tcp --dport 38130 -j DROP
{noformat}
# Quit the gdb session to resume catalogd
{noformat}
(gdb) quit
A debugging session is active.

        Inferior 1 [process 18014] will be detached.

Quit anyway? (y or n) y
Detaching from program: 
/home/quanlong/workspace/Impala/be/build/debug/service/impalad, process 18014
[Inferior 1 (process 18014) detached]{noformat}
# Wait around 15 minutes, catalogd will show logs when it closed the connection
{noformat}
I0118 10:30:27.898712 20847 thrift-util.cc:198] TAcceptQueueServer client died: 
THRIFT_ETIMEDOUT{noformat}
# Only impalad has the connection opened
{noformat}
$ ss -ap | grep 26000
tcp  LISTEN    0       128                                              
0.0.0.0:26000                                               0.0.0.0:*           
         users:(("catalogd",pid=18014,fd=409))                                  
        
tcp  ESTAB     0       0                                              
127.0.0.1:38130                                             127.0.0.1:26000     
           users:(("impalad",pid=18064,fd=446)){noformat}
# We can check the dropped packages by "grep CATALOG_PKG /var/log/syslog"
# Delete the rules in iptables
{noformat}
$ sudo iptables -L INPUT -n --line-numbers
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         
1    LOG        tcp  --  0.0.0.0/0            0.0.0.0/0            tcp 
dpt:38130 LOG flags 0 level 4 prefix "CATALOG_PKG: "
2    DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:38130
$ sudo iptables -D INPUT 2
$ sudo iptables -D INPUT 1{noformat}
# The query in impalad still hangs even thought the network recovered

> Coordinator should retry GetPartialCatalogObject request and apply a recv 
> timeout
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-12699
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12699
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Priority: Critical
>
> We have seen trivial GetPartialCatalogObject RPCs hanging in coordinator 
> side, e.g. IMPALA-11409. Due to the piggyback mechanism of fetching metadata 
> in local-catalog mode (see IMPALA-7534 or comments in 
> CatalogdMetaProvider#loadWithCaching()), a hanging RPC on shared metadata 
> (e.g. db list or table list of a db) could block other queries.
> We have also seen thrift RPCs hanging in IMPALA-3575. In fact, 
> GetPartialCatalogObject RPCs are read-only requests. They can be cleanly 
> retried. We should consider using a dedicated catalogd client cache for 
> GetPartialCatalogObject requests and set an appropriate timeout for the 
> socket.
> The current catalogd client cache:
> https://github.com/apache/impala/blob/cdac777c51febc99500b8426c2b3aabc7e9addd7/be/src/runtime/exec-env.cc#L224-L226
> The related flags:
> https://github.com/apache/impala/blob/cdac777c51febc99500b8426c2b3aabc7e9addd7/be/src/runtime/exec-env.cc#L161-L167
> CC [~wzhou]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-12699) Coordinator should retry GetPartialCatalogObject request and apply a recv timeout

Reply via email to