[ 
https://issues.apache.org/jira/browse/FLINK-32751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17751105#comment-17751105
 ] 

Matthias Pohl commented on FLINK-32751:
---------------------------------------

The actual test failure happened in 
{{SortDistinctAggregateITCase#testMultiDistinctAggOnDifferentColumn}} which 
derives the test from {{DistinctAggregateITCaseBase}}.

FYI: This can be determined by looking at the surefire reporting which prints 
that {{HashDistinctAggregateITCase}} completed but 
{{SortDistinctAggregateITCase}} didn't.
{code}
[...]
Aug 04 02:12:29 02:12:29.073 [INFO] Running 
org.apache.flink.table.planner.runtime.batch.sql.agg.SortDistinctAggregateITCase
[...]
Aug 04 02:19:04 02:19:04.720 [INFO] Running 
org.apache.flink.table.planner.runtime.batch.sql.agg.HashDistinctAggregateITCase
Aug 04 02:20:38 02:20:38.255 [INFO] Tests run: 23, Failures: 0, Errors: 0, 
Skipped: 0, Time elapsed: 93.527 s - in 
org.apache.flink.table.planner.runtime.batch.sql.agg.HashDistinctAggregateITCase
[...]
{code}

The issue we're seeing seems to be independent of the actual test, though.

The timeout happens when the {{CollectDynamicSink}} tries to request more data 
through the Dispatcher which forwards the request. Unfortunately, we don't have 
any logs from the Dispatcher side of that request. Therefore, we cannot 
reliably say where the request halted.

[~Sergey Nuyanzin] had a point when pointing out that there are multiple other 
past Jira issues (FLINK-20254, FLINK-22129, FLINK-22181, FLINK-22100) that had 
a similar stacktrace. These Jiras were handled as duplicates of FLINK-21996 
which was a bug in RPC layer with messages being swallowed. In the end, it's 
strange that the request wasn't completed in some way due to the 
{{MiniCluster}} having been shut down.

> DistinctAggregateITCaseBase.testMultiDistinctAggOnDifferentColumn got stuck 
> on AZP
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-32751
>                 URL: https://issues.apache.org/jira/browse/FLINK-32751
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / API
>    Affects Versions: 1.18.0
>            Reporter: Sergey Nuyanzin
>            Priority: Critical
>              Labels: test-stability
>
> This build hangs 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=51955&view=logs&j=ce3801ad-3bd5-5f06-d165-34d37e757d90&t=5e4d9387-1dcc-5885-a901-90469b7e6d2f&l=14399
> {noformat}
> Aug 04 03:03:47 "ForkJoinPool-1-worker-51" #28 daemon prio=5 os_prio=0 
> cpu=49342.66ms elapsed=3079.49s tid=0x00007f67ccdd0000 nid=0x5234 waiting on 
> condition  [0x00007f6791a19000]
> Aug 04 03:03:47    java.lang.Thread.State: WAITING (parking)
> Aug 04 03:03:47       at 
> jdk.internal.misc.Unsafe.park(java.base@11.0.19/Native Method)
> Aug 04 03:03:47       - parking to wait for  <0x00000000ad3b1fb8> (a 
> java.util.concurrent.CompletableFuture$Signaller)
> Aug 04 03:03:47       at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.19/LockSupport.java:194)
> Aug 04 03:03:47       at 
> java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.19/CompletableFuture.java:1796)
> Aug 04 03:03:47       at 
> java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.19/ForkJoinPool.java:3118)
> Aug 04 03:03:47       at 
> java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.19/CompletableFuture.java:1823)
> Aug 04 03:03:47       at 
> java.util.concurrent.CompletableFuture.get(java.base@11.0.19/CompletableFuture.java:1998)
> Aug 04 03:03:47       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.sendRequest(CollectResultFetcher.java:171)
> Aug 04 03:03:47       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:129)
> Aug 04 03:03:47       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
> Aug 04 03:03:47       at 
> org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
> Aug 04 03:03:47       at 
> org.apache.flink.table.planner.connectors.CollectDynamicSink$CloseableRowIteratorWrapper.hasNext(CollectDynamicSink.java:222)
> Aug 04 03:03:47       at 
> java.util.Iterator.forEachRemaining(java.base@11.0.19/Iterator.java:132)
> Aug 04 03:03:47       at 
> org.apache.flink.util.CollectionUtil.iteratorToList(CollectionUtil.java:122)
> Aug 04 03:03:47       at 
> org.apache.flink.table.planner.runtime.utils.BatchTestBase.executeQuery(BatchTestBase.scala:309)
> Aug 04 03:03:47       at 
> org.apache.flink.table.planner.runtime.utils.BatchTestBase.check(BatchTestBase.scala:145)
> Aug 04 03:03:47       at 
> org.apache.flink.table.planner.runtime.utils.BatchTestBase.checkResult(BatchTestBase.scala:109)
> Aug 04 03:03:47       at 
> org.apache.flink.table.planner.runtime.batch.sql.agg.DistinctAggregateITCaseBase.testMultiDistinctAggOnDifferentColumn(DistinctAggregateITCaseBase.scala:97)
> ~~
> {noformat}
> it is very likely that it is an old issue
> the similar case was mentioned for 1.11.0 and closed because of lack of 
> occurrences 
> FLINK-16923
> and another similar one FLINK-22100 which was marked as a duplicate of 
> FLINK-21996



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to