Hi,
Thank you very much for driving this FLIP in order to improve user
usability.

I understand that a key goal of this FLIP is to adjust the memory
requirements of shuffle to a more reasonable range. Through this adaptive
range adjustment, the memory efficiency can be improved under the premise
of ensuring the performance, thereby improving the user experience.

I have no problem with this goal, but I have a concern about the means of
implementation: should we introduce a _new_ non-orthogonal
option(`taskmanager.memory.network.required-buffer-per-gate.max`). That is
to say, the option will affect both streaming and batch shuffle behavior at
the same time.

>From the description in FLIP, we can see that we do not want this value to
be the same in streaming and batch scenarios. But we still let the user
configure this parameter, and once this parameter is configured, the
shuffle behavior of streaming and batch may be the same. In theory, there
may be a configuration that can meet the requirements of batch shuffle, but
it will affect the performance of streaming shuffle. (For example, we need
to reduce the memory overhead in batch scenarios, but it will affect the
performance of streaming shuffle). In other words, do we really want to add
a new option that exposes this possible risk problem?

  Personally, I think there might be two ways:
    1. Modify the current implementation of streaming shuffle. Don't let
the streaming shuffle performance regression. In this way, this option will
not couple streaming shuffle and batch shuffle. This also avoids confusion
for the user.  But I am not sure how to do it. :-)
    2. Introduce a pure batch read option, similar to the one introduced on
the batch write side.

BTW: It's better not to expose more implementation-related concepts to
users. For example, the "gate" is related to the internal implementation.
Relatively speaking, `shuffle.read/shuffle.client.read` may be more
general. After all, it can also avoid coupling with the topology structure
and scheduling units.

Best,
Guowei


On Fri, Dec 23, 2022 at 2:57 PM Lijie Wang <wangdachui9...@gmail.com> wrote:

> Hi,
>
> Thanks for driving this FLIP, +1 for the proposed changes.
>
> Limit the maximum value of shuffle read memory is very useful when using
> when using adaptive batch scheduler. Currently, the adaptive batch
> scheduler may cause a large number of input channels in a certain TM, so we
> generally recommend that users configure
> "taskmanager.network.memory.buffers-per-channel: 0" to decrease the the
> possibility of “Insufficient number of network buffers” error. After this
> FLIP, users no longer need to configure the
> "taskmanager.network.memory.buffers-per-channel".
>
> So +1 from my side.
>
> Best,
> Lijie
>
> Xintong Song <tonysong...@gmail.com> 于2022年12月20日周二 10:04写道:
>
> > Thanks for the proposal, Yuxin.
> >
> > +1 for the proposed changes. I think these are indeed helpful usability
> > improvements.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Mon, Dec 19, 2022 at 3:36 PM Yuxin Tan <tanyuxinw...@gmail.com>
> wrote:
> >
> > > Hi, devs,
> > >
> > > I'd like to start a discussion about FLIP-266: Simplify network memory
> > > configurations for TaskManager[1].
> > >
> > > When using Flink, users may encounter the following issues that affect
> > > usability.
> > > 1. The job may fail with an "Insufficient number of network buffers"
> > > exception.
> > > 2. Flink network memory size adjustment is complex.
> > > When encountering these issues, users can solve some problems by adding
> > or
> > > adjusting parameters. However, multiple memory config options should be
> > > changed. The config option adjustment requires understanding the
> detailed
> > > internal implementation, which is impractical for most users.
> > >
> > > To simplify network memory configurations for TaskManager and improve
> > Flink
> > > usability, this FLIP proposed some optimization solutions for the
> issues.
> > >
> > > Looking forward to your feedback.
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-266%3A+Simplify+network+memory+configurations+for+TaskManager
> > >
> > > Best regards,
> > > Yuxin
> > >
> >
>

Reply via email to