Github user uce commented on the pull request:
https://github.com/apache/flink/pull/471#issuecomment-78479167
The root "cause" of all asynchronous operations is that TCP connections are
shared among multiple logical channels, which are handled by a fixed number of
network I/O threads. In case of synchronous I/O operations, we would
essentially block progress on all channels sharing that connection/thread.
> When do you issue the read requests to the reader (from disk)? Is that
dependent on when the TCP channel is writable?
Yes, the network I/O thread has subpartitions queued for transfer and only
queries them for data when the TCP channel is writable.
> When the read request is issued, before the response comes, if the
subpartition de-registered from netty and the re-registered one a buffer has
returned from disk?
Exactly. If there is no buffer available, the read request is issued and
the next available subpartition is tried. If none of the subpartitions has data
available, the network I/O thread works on another TCP channel (this is done by
Netty, which multiplexes all TCP channels over a fixed amount of network I/O
threads).
> Given many spilled partitions, which one is read from next? How is the
buffer assignment realized? There is a lot of trickyness in there, because disk
I/O performs well with longer sequential reads, but that may occupy many
buffers that are missing for other reads into writable TCP channels.
Initially this depends on the order of partition requests. After that on
the order of data availability.
Regarding the buffers: trickyness, indeed. The current state with the
buffers is kind of an intermediate solution as we will issue zero-transfer
reads in the future (requires minimal changes), where we essentially only
trigger reads to gather offsets. The reads are then only affected by TCP
channel writability. Currently, the reads are batched in sizes of two buffers
(64k).
----
Regarding @tillrohrmann's changes: what was this exactly? Then I can verify
that the changes are not undone.
In general (minus the question regarding Till's changes) I think this PR is
good to merge. The tests are stable and passing. There will be definitely a
need to do refactorings and performance evaluations, but I think that is to be
expected with such a big change.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---