I have a very simple Spark Connect 
<https://spark.apache.org/docs/latest/spark-connect-overview.html> Python 
script that is showing very strange behavior that I am having trouble 
debugging. Spark Connect uses gRPC under the hood to communicate with a 
remote Apache Spark cluster. The cluster I’m pointed at is on a private 
network, so sharing the script is probably not helpful since you won’t be 
able to run it.

The gist of the issue is that this script connects to the cluster and runs 
a simple test query. (Think SELECT 1against a database.)

When I run the script from my home network, it takes 20-30 seconds to run. 
When I tether my workstation to my phone and run the same script, it takes 
1-2 seconds. So the issue is somehow connected to my home network.

I suspected DNS at first, but switching between Cloudflare, Quad9, and 
Google DNS all yield the same result on my home network.

I profiled my script using cProfile and the call that came up as taking all 
the time is this one:
method 'next_event' of 'grpc._cython.cygrpc.SegregatedCall' 

This call is made via this Spark Connect method 
<https://github.com/apache/spark/blob/v3.5.0/python/pyspark/sql/connect/client/core.py#L1351-L1378>
.

So I enabled gRPC tracing as described in the troubleshooting guide 
<https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md> to dig 
further:
GRPC_VERBOSITY=debug GRPC_TRACE=all python -u test.py &> home-network.txt 
GRPC_VERBOSITY=debug GRPC_TRACE=all python -u test.py &> lte-network.txt 

I would share the full output but it’s a lot of noise mixed with some 
secrets and keys that I would have to redact. But I did find what I believe 
to be the key difference between these two traces.

home-network.txt has a bunch of lines that lte-network.txt doesn’t have, 
which look like they are related to a network issue:
I0104 16:17:04.205512000 140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403024, tv_nsec: 405510000, clock_type: 1 }, reserved=0x0) 
D0104 16:17:04.249492000 140704379816832 grpc_ares_wrapper.cc:366] (c-ares 
resolver) request:0x7fb63e1d38f0 readable on c-ares fd: 8 D0104 
16:17:04.249705000 140704379816832 grpc_ares_wrapper.cc:670] (c-ares 
resolver) request:0x7fb63e1d38f0 on_hostbyname_done_locked qtype=A 
host=dbc-REDACTED.cloud.databricks.com ARES_SUCCESS D0104 
16:17:04.249733000 140704379816832 grpc_ares_wrapper.cc:712] (c-ares 
resolver) request:0x7fb63e1d38f0 c-ares resolver gets a AF_INET result: 
addr: 44.REDACTED port: 443 D0104 16:17:04.249763000 140704379816832 
grpc_ares_wrapper.cc:193] (c-ares resolver) request:0x7fb63e1d38f0 Ref 
ev_driver 0x7fb63e1fcd60 D0104 16:17:04.249781000 140704379816832 
grpc_ares_wrapper.cc:449] (c-ares resolver) request:0x7fb63e1d38f0 notify 
read on: c-ares fd: 8 D0104 16:17:04.249794000 140704379816832 
grpc_ares_wrapper.cc:204] (c-ares resolver) request:0x7fb63e1d38f0 Unref 
ev_driver 0x7fb63e1fcd60 D0104 16:17:04.267750000 140704379816832 
grpc_ares_wrapper.cc:366] (c-ares resolver) request:0x7fb63e1d38f0 readable 
on c-ares fd: 8 D0104 16:17:04.267957000 140704379816832 
grpc_ares_wrapper.cc:193] (c-ares resolver) request:0x7fb63e1d38f0 Ref 
ev_driver 0x7fb63e1fcd60 D0104 16:17:04.268002000 140704379816832 
grpc_ares_wrapper.cc:449] (c-ares resolver) request:0x7fb63e1d38f0 notify 
read on: c-ares fd: 8 D0104 16:17:04.268026000 140704379816832 
grpc_ares_wrapper.cc:204] (c-ares resolver) request:0x7fb63e1d38f0 Unref 
ev_driver 0x7fb63e1fcd60 # NOTE: Both traces are more or less identical up 
to this point. # The following part, however, is specific to 
home-network.txt. I0104 16:17:04.407392000 140704379816832 
completion_queue.cc:1069] RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 
16:17:04.407523000 140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403024, tv_nsec: 607516000, clock_type: 1 }, reserved=0x0) 
I0104 16:17:04.609106000 140704379816832 completion_queue.cc:1069] 
RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 16:17:04.609209000 
140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403024, tv_nsec: 809195000, clock_type: 1 }, reserved=0x0) 
I0104 16:17:04.809394000 140704379816832 completion_queue.cc:1069] 
RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 16:17:04.809430000 
140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403025, tv_nsec: 9426000, clock_type: 1 }, reserved=0x0) I0104 
16:17:05.011529000 140704379816832 completion_queue.cc:1069] 
RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT I0104 16:17:05.011597000 
140704379816832 completion_queue.cc:965] 
grpc_completion_queue_next(cq=0x7fb61df17470, deadline=gpr_timespec { 
tv_sec: 1704403025, tv_nsec: 211592000, clock_type: 1 }, reserved=0x0) 
I0104 16:17:05.201301000 123145868943360 timer_manager.cc:207] wait ended: 
was_timed:1 kicked:0 D0104 16:17:05.201404000 123145868943360 
timer_generic.cc:696] TIMER CHECK BEGIN: now=2615 next=9223372036854775807 
tls_min=1609 glob_min=2610 

There are 113 instances of RETURN_EVENT[0x7fb61df17470]: QUEUE_TIMEOUT in 
home-network.txt and none at all in lte-network.txt.

So I have a lead, but I don’t know where to go from here. None of my 
coworkers have this issue.

What’s the meaning of QUEUE_TIMEOUT? How can I debug this further?

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/2151eaf3-33c8-473b-9517-fd05f0638335n%40googlegroups.com.

Reply via email to