[
https://issues.apache.org/jira/browse/FLINK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226640#comment-17226640
]
Roman Khachatryan commented on FLINK-19964:
-------------------------------------------
I assumed at first that it's caused by my recent addition of waiting for
EndOfChannelState event.
But git bisect gave a rather old commit:
{code:java}
e17dbab24f4f71c5472d27267e938791686e45c3 is the first bad commit
commit e17dbab24f4f71c5472d27267e938791686e45c3
Author: Arvid Heise <[email protected]>
Date: Fri Sep 25 14:39:17 2020 +0200
[FLINK-16972][network] LocalBufferPool eagerly fetches global segments to
ensure proper availability.
Before this commit, availability of LocalBufferPool depended on a the
availability of a shared NetworkBufferPool. However, if multiple
LocalBufferPools simultaneously are available only because the
NetworkBufferPool becomes available with one segment, only one of the
LocalBufferPools is truly available (the one that actually acquires this
segment).
The solution in this commit is to define availability only through the
guaranteed ability to provide a memory segment to the consumer. If a
LocalBufferPool runs out of local segments it will become unavailable until it
receives a segment from the NetworkBufferPool. To minimize unavailability,
LocalBufferPool first tries to eagerly fetch new segments before declaring
unavailability and if that fails, the local pool subscribes to the availability
to the network pool to restore availability asap.
Additionally, LocalBufferPool would switch to unavailable only after it
could not serve a requested memory segment. For requestBufferBuilderBlocking
that is too late as it entered the blocking loop already.
Finally, LocalBufferPool now permanently holds at least one buffer. To
reflect that, the number of required segments needs to be at least one, which
matches all usages in production code. A few test needed to be adjusted to
properly capture the new requirement.
:040000 040000 1331ab5652c4bfbdbed02576f4e57a87ccaa1170
4f767acd0262ba07eefbdee6b8bd717ef1957765 M flink-runtime
{code}
Reverting it on master solves the problem.
The failure itself doesn't happen all the time. Also adding logging or running
in debug mode prevents it.
Given that, and that it's quite old, I'd lower the priority. WDYT [~rmetzger],
[~pnowojski]?
> Gelly ITCase stuck on Azure in HITSITCase.testPrintWithRMatGraph
> ----------------------------------------------------------------
>
> Key: FLINK-19964
> URL: https://issues.apache.org/jira/browse/FLINK-19964
> Project: Flink
> Issue Type: Bug
> Components: Library / Graph Processing (Gelly), Runtime / Network,
> Tests
> Affects Versions: 1.12.0
> Reporter: Chesnay Schepler
> Assignee: Roman Khachatryan
> Priority: Blocker
> Labels: test-stability
> Fix For: 1.12.0
>
>
> The HITSITCase has gotten stuck on Azure. Chances are that something in the
> scheduling or network has broken it.
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8919&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5
--
This message was sent by Atlassian Jira
(v8.3.4#803005)