[ 
https://issues.apache.org/jira/browse/FLINK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Wysakowicz reassigned FLINK-24300:
----------------------------------------

    Assignee: Dawid Wysakowicz

> MultipleInputOperator is running much more slowly in TPCDS
> ----------------------------------------------------------
>
>                 Key: FLINK-24300
>                 URL: https://issues.apache.org/jira/browse/FLINK-24300
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: Zhilong Hong
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>             Fix For: 1.14.0, 1.15.0
>
>         Attachments: 64570e4c56955713ca599fd1d7ae7be891a314c6.png, 
> detail-of-the-job.png, e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png, 
> jstack-2.txt, jstack.txt
>
>
> When we are running TPCDS with release 1.14 we find that the job with 
> {{MultipleInputOperator}} is running much more slowly than before. With a 
> binary search among the commits, we find that the issue may be introduced by 
> FLINK-23408. 
> At the commit 64570e4c56955713ca599fd1d7ae7be891a314c6, the job in TPCDS runs 
> normally, as the image below illustrates:
> !64570e4c56955713ca599fd1d7ae7be891a314c6.png|width=600!
> At the commit e3010c16947ed8da2ecb7d89a3aa08dacecc524a, the job q2.sql gets 
> stuck for a pretty long time (longer than half an hour), as the image below 
> illustrates:
> !e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png|width=600!
> The detail of the job is illustrated below:
> !detail-of-the-job.png|width=600!
> The job uses a {{MultipleInputOperator}} with one normal input and two 
> chained FileSource. It has finished reading the normal input and start to 
> read the chained source. Each chained source has one source data fetcher.
> We capture the jstack of the stuck tasks and attach the file below. From the 
> [^jstack.txt] we can see the main thread is blocked on waiting for the lock, 
> and the lock is held by a source data fetcher. The source data fetcher is 
> still running but the stack keeps on {{CompletableFuture.cleanStack}}.
> This issue happens in a batch job. However, from where it get blocked, it 
> seems also affects the streaming jobs.
> For the reference, the code of TPCDS we are running is located at 
> [https://github.com/ververica/flink-sql-benchmark/tree/dev].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to