[ 
https://issues.apache.org/jira/browse/IMPALA-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136727#comment-17136727
 ] 

Joe McDonnell commented on IMPALA-9842:
---------------------------------------

Hit it with the diagnostic:

[https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/11007/]

[https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/11007/artifact/Impala/logs_static/logs/ee_tests/minidumps/impalad/30efb5db-34a7-4a2c-c50611bf-ccf315b2.dmp_dumped/*view*/]

Two threads are in Exec code. First thread is in 
BufferedPlanRootSink::FlushFinal():
{noformat}
Thread 441
 0  libpthread-2.23.so + 0xd360
    rax = 0xfffffffffffffe00   rdx = 0x0000000000000001
    rcx = 0x00007f6a52d02360   rbx = 0x0000000000000000
    rsi = 0x0000000000000080   rdi = 0x0000000073ff881c
    rbp = 0x00007f692d8c0b10   rsp = 0x00007f692d8c0aa0
     r8 = 0x0000000073ff8700    r9 = 0x0000000000000000
    r10 = 0x0000000000000000   r11 = 0x0000000000000246
    r12 = 0x0000000073ff8700   r13 = 0x0000000000000001
    r14 = 0x0000000000000001   r15 = 0x00000000d9b95538
    rip = 0x00007f6a52d02360
    Found by: given as instruction pointer in context
 1  impalad!impala::BufferedPlanRootSink::FlushFinal(impala::RuntimeState*) 
[buffered-plan-root-sink.cc : 111 + 0x20]
    rbp = 0x00007f692d8c0c40   rsp = 0x00007f692d8c0b20
    rip = 0x00000000028119a0
    Found by: previous frame's frame pointer
 2  impalad!impala::FragmentInstanceState::ExecInternal() 
[fragment-instance-state.cc : 437 + 0x47]
    rbx = 0x000000000281174c   rbp = 0x00007f692d8c0ee0
    rsp = 0x00007f692d8c0c50   r12 = 0x0000000073ff8700
    rip = 0x0000000002200e79
    Found by: call frame info
 3  impalad!impala::FragmentInstanceState::Exec() [fragment-instance-state.cc : 
107 + 0x16]
    rbx = 0x000000000e5935e0   rbp = 0x00007f692d8c1030
    rsp = 0x00007f692d8c0ef0   r12 = 0x0000000000000001
    rip = 0x00000000021fd0e7
    Found by: call frame info
 4  impalad!impala::QueryState::ExecFInstance(impala::FragmentInstanceState*) 
[query-state.cc : 791 + 0x19]
    rbx = 0x0000000000000001   rbp = 0x00007f692d8c1150
    rsp = 0x00007f692d8c1040   r12 = 0x0000000000000001
    rip = 0x0000000002226a24
    Found by: call frame info
 5  impalad!impala::QueryState::StartFInstances()::{lambda()#1}::operator()() 
const + 0x26
    rbx = 0x00000000000134c6   rbp = 0x00007f692d8c1170
    rsp = 0x00007f692d8c1160   r12 = 0x0000000008830ed0
    r13 = 0x00007f693fcea150   r14 = 0x000000008e21c420
    r15 = 0x00000000914c74d0   rip = 0x0000000002224e26
    Found by: call frame info{noformat}
Second thread is monitoring it in MonitorFInstances():
{noformat}
Thread 440
 0  libpthread-2.23.so + 0xd709
    rax = 0xfffffffffffffdfc   rdx = 0x00000000000003d7
    rcx = 0x00007f6a52d02709   rbx = 0x0000000000000000
    rsi = 0x0000000000000089   rdi = 0x00000000867b3504
    rbp = 0x00007f693fcea5d0   rsp = 0x00007f693fcea530
     r8 = 0x00000000867b3538    r9 = 0x00000000ffffffff
    r10 = 0x00007f693fcea600   r11 = 0x0000000000000202
    r12 = 0x00000000000003d7   r13 = 0x00007f693fcea600
    r14 = 0xffffffffffffff92   r15 = 0x0000000060693200
    rip = 0x00007f6a52d02709
    Found by: given as instruction pointer in context
 1  impalad!impala::ConditionVariable::WaitFor(std::unique_lock<std::mutex>&, 
long) [condition-variable.h : 67 + 0x17]
    rbp = 0x00007f693fcea610   rsp = 0x00007f693fcea5e0
    rip = 0x000000000215c464
    Found by: previous frame's frame pointer
 2  impalad!impala::Promise<bool, (impala::PromiseMode)0>::Get(long, bool*) 
[promise.h : 108 + 0x1d]
    rbp = 0x00007f693fcea6e0   rsp = 0x00007f693fcea620
    rip = 0x000000000215fff8
    Found by: call frame info
 3  impalad!impala::TypedCountingBarrier<bool>::Wait(long, bool*) 
[counting-barrier.h : 67 + 0x17]
    rbx = 0x0000000000000000   rbp = 0x00007f693fcea710
    rsp = 0x00007f693fcea6f0   r12 = 0x00000000949a26e0
    rip = 0x000000000222a981
    Found by: call frame info
 4  impalad!impala::CountingBarrier::Wait(long, bool*) [counting-barrier.h : 
102 + 0x17]
    rbx = 0x0000000000000000   rbp = 0x00007f693fcea740
    rsp = 0x00007f693fcea720   r12 = 0x00000000949a26e0
    rip = 0x0000000002229c41
    Found by: call frame info
 5  impalad!impala::QueryState::WaitForFinishOrTimeout(int) [query-state.cc : 
643 + 0x29]
    rbx = 0x0000000000000000   rbp = 0x00007f693fcea770
    rsp = 0x00007f693fcea750   r12 = 0x00000000949a26e0
    rip = 0x0000000002224df6
    Found by: call frame info
 6  impalad!impala::QueryState::MonitorFInstances() [query-state.cc : 749 + 
0x22]
    rbx = 0x0000000000000000   rbp = 0x00007f693fcea860
    rsp = 0x00007f693fcea780   r12 = 0x00000000949a26e0
    rip = 0x0000000002226274
    Found by: call frame info
 7  impalad!impala::QueryExecMgr::ExecuteQueryHelper(impala::QueryState*) 
[query-exec-mgr.cc : 141 + 0xf]
    rbx = 0x000000008646e000   rbp = 0x00007f693fceb0b0
    rsp = 0x00007f693fcea870   r12 = 0x00000000949a26e0
    rip = 0x0000000002279a3d
    Found by: call frame info{noformat}

> TestValidateMetrics.test_metrics_are_zero fails with num-fragments-in-flight 
> not reaching zero
> ----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-9842
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9842
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>              Labels: broken-build, flaky
>
> A couple test runs have failed with TestValidateMetrics.test_metrics_are_zero 
> failure: 
> {noformat}
> AssertionError: Metric value impala-server.num-fragments-in-flight did not 
> reach value 0 in 60s Dumping impalad debug pages: ...{noformat}
> The memz shows: 
> {noformat}
> Process: Limit=12.00 GB Total=2.93 GB Peak=138.77 GB
>   JVM: max heap size: Total=1.78 GB
>   JVM: non-heap committed: Total=211.75 MB
>   Buffer Pool: Free Buffers: Total=1.90 MB
>   Buffer Pool: Clean Pages: Total=0  
>   Buffer Pool: Unused Reservation: Total=-2.00 MB
>   Control Service Queue: Limit=50.00 MB Total=0 Peak=16.47 MB
>   Data Stream Service Queue: Limit=614.40 MB Total=0 Peak=636.91 MB
>   Data Stream Manager Early RPCs: Total=0 Peak=684.53 KB
>   TCMalloc Overhead: Total=347.15 MB
>   Spill-to-disk temporary compression buffers: Limit=512.00 MB Total=0 
> Peak=110.43 MB
>   RequestPool=root.default: Total=4.02 MB Peak=136.23 GB
>     Query(074a86ef88e16768:1a7bf9f400000000): Limit=300.00 MB 
> Reservation=4.00 MB Reserva$
>       Fragment 074a86ef88e16768:1a7bf9f400000000: Reservation=4.00 MB 
> OtherMemory=16.00 K$
>         HDFS_SCAN_NODE (id=0): Reservation=0 OtherMemory=0 Total=0 
> Peak=830.00 KB
>         PLAN_ROOT_SINK: Reservation=4.00 MB ReservationLimit=100.00 MB 
> OtherMemory=8.00 K$
>   RequestPool=fe-eval-exprs: Total=0 Peak=256.00 MB
>   RequestPool=root.no-limits: Total=0 Peak=188.88 MB
>   Untracked Memory: Total=616.14 MB
> {noformat}
> Query 074a86ef88e16768:1a7bf9f4 is: 
> {noformat}
> I0608 19:09:08.410117 25628 Frontend.java:1498] 
> 074a86ef88e16768:1a7bf9f400000000] Analyzing query: select * from 
> functional.alltypes limit 10 db: default{noformat}
> The interesting query options are debug_action and fetch_rows_timeout_ms. 
> These narrow it down to hs2/test_fetch_timeout.py 
> ([https://github.com/apache/impala/blob/master/tests/hs2/test_fetch_timeout.py#L207-L215]):
> {noformat}
>   11: debug_action (string) = "CRS_BEFORE_COORD_STARTS:SLEEP@5000",
> ...
>   94: fetch_rows_timeout_ms (i64) = 0,{noformat}
>  It is executing only at the coordinator: 
> {noformat}
> I0608 19:09:08.430388 25637 scheduler.cc:549] 
> 074a86ef88e16768:1a7bf9f400000000] Exec at coord is true
> I0608 19:09:08.430755 25637 admission-controller.cc:1296] 
> 074a86ef88e16768:1a7bf9f400000000] Trying to admit 
> id=074a86ef88e16768:1a7bf9f400000000 in pool_name=root.default 
> executor_group_name=default per_host_mem_estimate=20.00 MB 
> dedicated_coord_mem_estimate=120.00 MB max_requests=-1 (configured 
> statically) max_queued=200 (configured statically) max_mem=29.30 GB 
> (configured statically)
> I0608 19:09:08.430793 25637 admission-controller.cc:1308] 
> 074a86ef88e16768:1a7bf9f400000000] Stats: agg_num_running=10, 
> agg_num_queued=0, agg_mem_reserved=3.79 GB,  
> local_host(local_mem_admitted=4.67 GB, num_admitted_running=10, num_queued=0, 
> backend_mem_reserved=2.24 GB)
> I0608 19:09:08.430811 25637 admission-controller.cc:894] 
> 074a86ef88e16768:1a7bf9f400000000] Admitting query 
> id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:08.430881 25637 debug-util.cc:427] 
> 074a86ef88e16768:1a7bf9f400000000] Debug Action: 
> CRS_BEFORE_COORD_STARTS:SLEEP@5000 sleeping for 5000 ms{noformat}
>  The query starts execution: 
> {noformat}
> I0608 19:09:13.431097 25637 coordinator.cc:143] 
> 074a86ef88e16768:1a7bf9f400000000] Exec() 
> query_id=074a86ef88e16768:1a7bf9f400000000 stmt=select * from 
> functional.alltypes limit 10
> I0608 19:09:13.431520 25637 coordinator.cc:463] 
> 074a86ef88e16768:1a7bf9f400000000] starting execution on 1 backends for 
> query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.432674   114 control-service.cc:153] 
> 074a86ef88e16768:1a7bf9f400000000] ExecQueryFInstances(): 
> query_id=074a86ef88e16768:1a7bf9f400000000 coord=f0873f7637c8:22000 
> #instances=1
> ...
> I0608 19:09:13.433480 25637 coordinator.cc:520] 
> 074a86ef88e16768:1a7bf9f400000000] started execution on 1 backends for 
> query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.434790 26254 query-state.cc:756] 
> 074a86ef88e16768:1a7bf9f400000000] Executing instance. 
> instance_id=074a86ef88e16768:1a7bf9f400000000 fragment_idx=0 
> per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=11{noformat}
> The query is unregistered: 
> {noformat}
> I0608 19:09:13.448494 25628 impala-server.cc:1173] UnregisterQuery(): 
> query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.448557 25628 coordinator.cc:644] ExecState: query 
> id=074a86ef88e16768:1a7bf9f400000000 execution cancelled
> I0608 19:09:13.448580 25628 coordinator-backend-state.cc:943] 
> query_id=074a86ef88e16768:1a7bf9f400000000 target backend=172.18.0.4:27000: 
> Sending CancelQueryFInstances rpc
> I0608 19:09:13.448971   114 control-service.cc:223] CancelQueryFInstances(): 
> query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.448995   114 query-exec-mgr.cc:105] QueryState: 
> query_id=074a86ef88e16768:1a7bf9f400000000 refcnt=4
> I0608 19:09:13.449007   114 query-state.cc:782] Cancel: 
> query_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.449023   114 krpc-data-stream-mgr.cc:337] cancelling active 
> streams for fragment_instance_id=074a86ef88e16768:1a7bf9f400000000
> I0608 19:09:13.449226 25628 coordinator.cc:889] CancelBackends() 
> query_id=074a86ef88e16768:1a7bf9f400000000, tried to cancel 1 backends
> I0608 19:09:13.449291 25628 coordinator.cc:1225] Release admission control 
> resources for query_id=074a86ef88e16768:1a7bf9f400000000{noformat}
> After that, there are periodic reports coming in for 
> 074a86ef88e16768:1a7bf9f400000000 which all return that it is an invalid 
> query. 
> The original query is starts at 19:09:08, while the 
> TestValidateMetrics.test_metrics_are_zero check for remaining fragments is 
> happening over an hour later at 20:14:08.**
> Interestingly, this has happened on different runs *with the exact same 
> query*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to