[
https://issues.apache.org/jira/browse/IMPALA-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17111509#comment-17111509
]
Sahil Takiar commented on IMPALA-9755:
--------------------------------------
I started looking into this, and can reproduce it locally. I think this was
caused by IMPALA-6984 (coordinator cancels backends on EOS). IIUC its basically
what Thomas said in this comment about flaky tests:
{quote}The reason is that runtime profiles for queries that are cancelled in
this way (eg. queries with LIMITs) are going to be non-deterministic, since
they may not finish processing and things like counters of rows read may vary,
and we have at least some tests that rely on profiles being deterministic for
successful queries.
{quote}
The query in question has a limit:
{code:java}
select count(*) from tpch_parquet.orders o inner join tpch_parquet.lineitem l
on o.o_orderkey = l.l_orderkey group by o.o_clerk limit 10 {code}
Basically, this test expects all entries for TotalBytesSent to be non-zero.
However, that doesn't always seem to be the case. Here is a snippet of the
runtime profile when this test failed locally:
{code:java}
Instance b845c723832e5033:ee960ede0000000b (host=stakiar-desktop:22002):(Total:
26s679ms, non-child: 832.242ms, % non-child: 3.12%)
Last report received time: 2020-05-19 12:50:18.219
Fragment Instance Lifecycle Event Timeline: 26s720ms
- Prepare Finished: 38.860ms (38.860ms)
- ExecInternal Finished: 26s720ms (26s681ms)
- MemoryUsage (500.000ms): 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB,
35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB,
35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB,
35.22 KB, 35.22 KB
, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22
KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22
KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22 KB, 35.22
KB, 35.22 KB, 35.22 KB, 35.22 KB
, 35.22 KB, 35.22 KB, 35.22 KB
- ThreadUsage (500.000ms): 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
- AverageThreadTokens: 1.00
- ExchangeScanRatio: 0.00
- PeakMemoryUsage: 2.03 MB (2133216)
- PeakReservation: 1.94 MB (2031616)
- PeakUsedReservation: 0
- PerHostPeakMemUsage: 75.41 MB (79076718)
- RowsProduced: 0 (0)
- TotalNetworkReceiveTime: 25s840ms
- TotalNetworkSendTime: 0.000ns
- TotalStorageWaitTime: 0.000ns
- TotalThreadsInvoluntaryContextSwitches: 193 (193)
- TotalThreadsTotalWallClockTime: 26s678ms
- TotalThreadsSysTime: 0.000ns
- TotalThreadsUserTime: 252.402ms
- TotalThreadsVoluntaryContextSwitches: 6 (6)
Fragment Instance Lifecycle Timings:
- OpenTime: 26s678ms
- ExecTreeOpenTime: 25s846ms
- PrepareTime: 689.756us
- ExecTreePrepareTime: 438.094us
KrpcDataStreamSender (dst_id=8):(Total: 87.778us, non-child: 87.778us, %
non-child: 100.00%)
- BytesSent (500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
- NetworkThroughput: 0.00 /sec (Number of samples: 0)
- RpcNetworkTime: 0.000ns (Number of samples: 0)
- RpcRecvrTime: 0.000ns (Number of samples: 0)
- EosSent: 0 (0)
- PeakMemoryUsage: 6.09 KB (6240)
- RowsSent: 0 (0)
- RpcFailure: 0 (0)
- RpcRetry: 0 (0)
- SerializeBatchTime: 0.000ns
- TotalBytesSent: 0
- UncompressedRowBatchSize: 0
AGGREGATION_NODE (id=7):(Total: 25s846ms, non-child: 5.028ms, % non-child:
0.02%)
Node Lifecycle Event Timeline: 26s720ms
- Open Started: 870.838ms (870.838ms)
- Open Finished: 26s717ms (25s846ms)
- Closed: 26s720ms (2.893ms)
- PeakMemoryUsage: 1.97 MB (2061440)
- RowsReturned: 0 (0)
- RowsReturnedRate: 0
GroupingAggregator 0:
ExecOption: Codegen Enabled
- BuildTime: 1.652ms
- GetResultsTime: 0.000ns
- HTResizeTime: 573.282us
- LargestPartitionPercent: 0 (0)
- MaxPartitionLevel: 0 (0)
- NumRepartitions: 0 (0)
- PartitionsCreated: 16 (16)
- PeakMemoryUsage: 1.96 MB (2057344)
- PeakReservation: 1.94 MB (2031616)
- PeakUsedReservation: 0
- RowsRepartitioned: 0 (0)
- RowsReturned: 0 (0)
- SpilledPartitions: 0 (0)
Buffer pool:
- AllocTime: 207.359us
- CompressionTime: 0.000ns
- CumulativeAllocationBytes: 1.56 MB (1638400)
- CumulativeAllocations: 25 (25)
- EncryptionTime: 0.000ns
- PeakReservation: 1.94 MB (2031616)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 1.56 MB (1638400)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- SystemAllocTime: 165.527us
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
Hash Table:
- HashBuckets: 0 (0)
- HashCollisions: 0 (0)
- Probes: 1.02K (1024)
- Resizes: 16 (16)
- Travel: 18 (18)
EXCHANGE_NODE (id=6):(Total: 25s841ms, non-child: 19.530ms, % non-child:
0.08%)
Node Lifecycle Event Timeline: 26s720ms
- Open Started: 870.843ms (870.843ms)
- Open Finished: 26s633ms (25s762ms)
- First Batch Requested: 26s636ms (2.899ms)
- First Batch Returned: 26s697ms (61.176ms)
- Closed: 26s720ms (22.751ms)
- ConvertRowBatchTime: 1.313ms
- PeakMemoryUsage: 56.00 KB (57344)
- RowsReturned: 1.05K (1053)
- RowsReturnedRate: 40.00 /sec
Buffer pool:
- AllocTime: 26.359us
- CompressionTime: 0.000ns
- CumulativeAllocationBytes: 72.00 KB (73728)
- CumulativeAllocations: 6 (6)
- EncryptionTime: 0.000ns
- PeakReservation: 56.00 KB (57344)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 56.00 KB (57344)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- SystemAllocTime: 0.000ns
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
Dequeue:
- BytesDequeued (500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
- FirstBatchWaitTime: 25s762ms
- TotalBytesDequeued: 45.25 KB (46332)
- TotalGetBatchTime: 25s840ms
- DataWaitTime: 25s822ms
Enqueue:
- BytesReceived (500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
- DeferredQueueSize (500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
- DispatchTime: (Avg: -77029.000ns ; Min: -97284.000ns ; Max:
-64015.000ns ; Number of samples: 3)
- DeserializeRowBatchTime: 458.512us
- TotalBatchesEnqueued: 3 (3)
- TotalBatchesReceived: 3 (3)
- TotalBytesReceived: 16.45 KB (16845)
- TotalEarlySenders: 0 (0)
- TotalEosReceived: 2 (2)
- TotalHasDeferredRPCsTime: 0.000ns
- TotalRPCsDeferred: 0 (0)
{code}
The KrpcDataStreamSender has TotalBytesSent = 0 because it sends no rows. I
suspect because it got cancelled before it could send any data.
The coordinator logs for this query:
{code:java}
I0519 12:50:18.210763 25363 coordinator.cc:644] ExecState: query
id=b845c723832e5033:ee960ede00000000 execution completed
I0519 12:50:18.210934 25363 coordinator-backend-state.cc:934]
query_id=b845c723832e5033:ee960ede00000000 target backend=127.0.0.1:27002:
Sending CancelQueryFInstances rpc
I0519 12:50:18.213335 25378 krpc-data-stream-mgr.cc:306]
b845c723832e5033:ee960ede0000000a] DeregisterRecvr():
fragment_instance_id=b845c723832e5033:ee960ede0000000a, node=6
I0519 12:50:18.213862 29230 coordinator.cc:917] Backend completed:
host=stakiar-desktop:22001 remaining=3
query_id=b845c723832e5033:ee960ede00000000
I0519 12:50:18.217818 25363 coordinator-backend-state.cc:934]
query_id=b845c723832e5033:ee960ede00000000 target backend=127.0.0.1:27001: Not
cancelling because the backend is already done:
I0519 12:50:18.217846 25363 coordinator-backend-state.cc:934]
query_id=b845c723832e5033:ee960ede00000000 target backend=127.0.0.1:27000:
Sending CancelQueryFInstances rpc
I0519 12:50:18.217949 29230 coordinator-backend-state.cc:398]
query_id=b845c723832e5033:ee960ede00000000: first in-progress backend:
stakiar-desktop:22002
I0519 12:50:18.221185 29235 coordinator.cc:917] Backend completed:
host=stakiar-desktop:22002 remaining=2
query_id=b845c723832e5033:ee960ede00000000
I0519 12:50:18.232980 29230 control-service.cc:223] CancelQueryFInstances():
query_id=b845c723832e5033:ee960ede00000000
I0519 12:50:18.233013 29230 query-exec-mgr.cc:105] QueryState:
query_id=b845c723832e5033:ee960ede00000000 refcnt=5
I0519 12:50:18.233026 29230 query-state.cc:773] Cancel:
query_id=b845c723832e5033:ee960ede00000000
I0519 12:50:18.233052 29230 krpc-data-stream-mgr.cc:337] cancelling active
streams for fragment_instance_id=b845c723832e5033:ee960ede00000000
I0519 12:50:18.233069 29230 krpc-data-stream-mgr.cc:337] cancelling active
streams for fragment_instance_id=b845c723832e5033:ee960ede0000000a
I0519 12:50:18.233101 29230 krpc-data-stream-mgr.cc:337] cancelling active
streams for fragment_instance_id=b845c723832e5033:ee960ede00000007
I0519 12:50:18.233114 29230 krpc-data-stream-mgr.cc:337] cancelling active
streams for fragment_instance_id=b845c723832e5033:ee960ede00000002
I0519 12:50:18.233125 29230 krpc-data-stream-mgr.cc:337] cancelling active
streams for fragment_instance_id=b845c723832e5033:ee960ede00000005
I0519 12:50:18.233408 25363 coordinator.cc:884] CancelBackends()
query_id=b845c723832e5033:ee960ede00000000, tried to cancel 2 backends
I0519 12:50:18.233425 25363 coordinator.cc:789] Coordinator waiting for
backends to finish, 2 remaining. query_id=b845c723832e5033:ee960ede00000000
{code}
The executor logs for this fragment weren't very helpful.
Since IMPALA-6984 was recently merged, it could explain why this test started
becoming more flaky recently.
I think the fix for the test should be pretty straightforward, I'll work on a
patch.
> Flaky test: test_global_exchange_counters
> -----------------------------------------
>
> Key: IMPALA-9755
> URL: https://issues.apache.org/jira/browse/IMPALA-9755
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Reporter: Tim Armstrong
> Assignee: Sahil Takiar
> Priority: Blocker
> Labels: flaky
>
> {noformat}
> query_test.test_observability.TestObservability.test_global_exchange_counters
> (from pytest)
> Failing for the past 1 build (Since Failed#10637 )
> Took 22 sec.
> add description
> Error Message
> query_test/test_observability.py:504: in test_global_exchange_counters
> assert m, "Cannot match pattern for key %s in line '%s'" % (key, line) E
> AssertionError: Cannot match pattern for key TotalBytesSent in line '
> - TotalBytesSent: 0' E assert None
> Stacktrace
> query_test/test_observability.py:504: in test_global_exchange_counters
> assert m, "Cannot match pattern for key %s in line '%s'" % (key, line)
> E AssertionError: Cannot match pattern for key TotalBytesSent in line '
> - TotalBytesSent: 0'
> E assert None
> {noformat}
> https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/10637/testReport/junit/query_test.test_observability/TestObservability/test_global_exchange_counters/
> Filing in case it reoccurs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]