[
https://issues.apache.org/jira/browse/FLINK-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163339#comment-14163339
]
Ufuk Celebi commented on FLINK-1141:
------------------------------------
I don't know how this applies to a self-join, but it is in general
deadlock-prone to send an intermediate result to more than one consumer in a
pipelined fashion. If that applies to the self-join as well, this will be
resolved when we go for the blocking shuffles (finish intermediate data set,
then start sending) for all intermediate data sets, which have more than one
consumer. This will happen very soon (would be a good test case actually ;)).
If it is important to you to work with the larger data set right now, we can go
for a quick fix. Otherwise, I would ask you to wait a week.
> Selfjoin fails after DataSet exceeds certain size
> -------------------------------------------------
>
> Key: FLINK-1141
> URL: https://issues.apache.org/jira/browse/FLINK-1141
> Project: Flink
> Issue Type: Bug
> Components: Local Runtime
> Affects Versions: 0.6.1-incubating
> Environment: LocalExecutionEnvironment (dop=4)
> Reporter: Robert Waury
> Priority: Minor
> Attachments: LargeSelfJoin.java
>
>
> As soon as a DataSet exceeds a certain size (1000000 tuples in my example) a
> Selfjoin with a FlatJoinFunction no longer works. After around a second the
> Join, DataSource and DataSink threads are all in Wait and don't perform any
> work (no output files are created) and the job never finishes.
> If I cut the input size in half it works fine.
> My current workaround is to create the DataSet twice and join the two
> identical DataSets.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)