Hi all,

Thanks very much for all of the feedbacks. It seems that we have reached a
consensus. I will start a vote soon.

Best,
Yingjie

Yun Gao <yungao...@aliyun.com> 于2022年1月5日周三 16:08写道:

> Very thanks @Yingjie for completing the experiments!
>
> Also +1 for changing the default config values. From the experiments,
> Changing the default config values would largely increase the open box
> experience of the flink batch, thus it seems worth changing from my side
> even if it would cause some compatibility issue under some scenarios. In
> addition, if we finally have to break compatibility, we might do it early
> to
> avoid affecting more users.
>
> Best,
> Yun
>
> ------------------------------------------------------------------
> From:刘建刚 <liujiangangp...@gmail.com>
> Send Time:2022 Jan. 4 (Tue.) 20:43
> To:user-zh <user...@flink.apache.org>
> Cc:dev <dev@flink.apache.org>; user <u...@flink.apache.org>
> Subject:Re: [DISCUSS] Change some default config values of blocking shuffle
>
> Thanks for the experiment. +1 for the changes.
>
> Yingjie Cao <kevin.ying...@gmail.com> 于2022年1月4日周二 17:35写道:
>
> > Hi all,
> >
> > After running some tests with the proposed default value (
> > taskmanager.network.sort-shuffle.min-parallelism: 1,
> > taskmanager.network.sort-shuffle.min-buffers: 512,
> > taskmanager.memory.framework.off-heap.batch-shuffle.size: 64m,
> > taskmanager.network.blocking-shuffle.compression.enabled: true), I'd to
> > share some test results.
> >
> > 1. TPC-DS performance and stability test (I the TPC-DS benchmark using
> 512
> > default parallelism and several different settings multiple times):
> > 1) Stability:
> > Compared to the current default values, the proposed default values can
> > improve the TPC-DS query stability a lot. With the current default, there
> > are many queries suffering from blocking shuffle relevant failures. With
> > the proposed default values, only three queries fail because of the
> > "Insufficient number of network buffers:" error. With 512 parallelism,
> the
> > current default configuration will incur the same issue. Part of the
> reason
> > is that the network buffer consumed by InputGate is  proportional to
> > parallelism and can use 32M network memory by default and many tasks has
> > several InputGate but we only has 128M network memory per TaskManager by
> > default.
> > 2) Performance:
> > Compared to the current default values, the proposed default values can
> > improve the TPC-DS query performance a lot. Except for those queries of
> > small shuffle data amount which consume really short time, the proposed
> > default values can bring 2-10 times performance gain. About the default
> > value of taskmanager.network.sort-shuffle.min-parallelism  proposed by
> > Yun, I tested both 1 and 128 and 1 is better for performance which is as
> > expected.
> >
> > 2. Flink pre-commit stability test:
> > I have run all Flink tests with the proposed default value for more than
> > 20 times. The only instability is the "Insufficient number of network
> > buffers:" error for batch several test cases. This error occurs because
> > some tests have really limited network buffers and the proposed default
> > config values may increase the network buffer consumption for cases.
> After
> > increase the total network size for these test cases, the issue can be
> > solved.
> >
> > Summary:
> > 1. The proposed default value can improve both the performance and
> > stability of Flink batch shuffle a lot.
> > 2. Some batch jobs may fail because of the "Insufficient number of
> network
> > buffers:" error for this default value change will increase the network
> > buffer consumption a little for jobs less than 512 parallelism (for jobs
> > more than 512 parallelism network buffer consumption will be reduced).
> > 3. Setting taskmanager.network.sort-shuffle.min-parallelism to 1 has
> > better performance than setting that to 128, both settings may incur the
> > "Insufficient number of network buffers:" error.
> > 4. After changing the default value and fixing several test cases, all
> > Flink tests (except for those known unstable cases) can run stably.
> >
> > Personally, I am +1 to make the change. Though the change may cause some
> > batch jobs fail because of the "Insufficient number of network buffers:",
> > the possibility is small enough (only 3 TPC-DS out of about 100 queries
> > fails, these queries will also fail with the current default
> configuration
> > because it is the InputGate which takes the most network buffers and cost
> > the error). Compared to this small regression, the performance and
> > stability gains are big. Any feedbacks especially those from Flink batch
> > users are highly appreciated.
> >
> > BTW, aside from the above tests, I also tries to tune some more config
> > options to try to make the TPC-DS test faster. I copied these tuned
> config
> > options from our daily TPC-DS settings. The results show that the
> optimized
> > configuration can improve the TPC-DS performance about 30%. Though these
> > settings may not the best, they really help compared to the default
> value.
> > I attached some settings in this may, I guess some Flink batch users may
> be
> > interested in this. Based on my limited knowledge, I guess that
> increasing
> > the total TaskManager size and network memory size is important for
> > performance, because more memory (managed and network) can make operators
> > and shuffle faster.
> >
> > Best,
> > Yingjie
> >
> >
> >
> > Yingjie Cao <kevin.ying...@gmail.com> 于2021年12月15日周三 12:19写道:
> >
> >> Hi Till,
> >>
> >> Thanks for the suggestion. I think it makes a lot of sense to also
> extend
> >> the documentation for the sort shuffle to include a tuning guide.
> >>
> >> Best,
> >> Yingjie
> >>
> >> Till Rohrmann <trohrm...@apache.org> 于2021年12月14日周二 18:57写道:
> >>
> >>> As part of this FLIP, does it make sense to also extend the
> >>> documentation for the sort shuffle [1] to include a tuning guide? I am
> >>> thinking of a more in depth description of what things you might
> observe
> >>> and how to influence them with the configuration options.
> >>>
> >>> [1]
> >>>
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/batch/blocking_shuffle/#sort-shuffle
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Tue, Dec 14, 2021 at 8:43 AM Jingsong Li <jingsongl...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Yingjie,
> >>>>
> >>>> Thanks for your explanation. I have no more questions. +1
> >>>>
> >>>> On Tue, Dec 14, 2021 at 3:31 PM Yingjie Cao <kevin.ying...@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > Hi Jingsong,
> >>>> >
> >>>> > Thanks for your feedback.
> >>>> >
> >>>> > >>> My question is, what is the maximum parallelism a job can have
> >>>> with the default configuration? (Does this break out of the box)
> >>>> >
> >>>> > Yes, you are right, these two options are related to network memory
> >>>> and framework off-heap memory. Generally, these changes will not
> break out
> >>>> of the box experience, but for some extreme cases, for example, there
> are
> >>>> too many ResultPartitions per task, users may need to increase network
> >>>> memory to avoid "insufficient network buffer" error. For framework
> >>>> off-head, I believe that user do not need to change the default value.
> >>>> >
> >>>> > In fact, I have a basic goal when changing these config values: when
> >>>> running TPCDS of medium parallelism with the default value, all
> queries
> >>>> must pass without any error and at the same time, the performance can
> be
> >>>> improved. I think if we achieve this goal, most common use cases can
> be
> >>>> covered.
> >>>> >
> >>>> > Currently, for the default configuration, the exclusive buffers
> >>>> required at input gate side is still parallelism relevant (though
> since
> >>>> 1.14, we can decouple the network buffer consumption from parallelism
> by
> >>>> setting a config value, it has slight performance influence on
> streaming
> >>>> jobs), which means that no large parallelism can be supported by the
> >>>> default configuration. Roughly, I would say the default value can
> support
> >>>> jobs of several hundreds of parallelism.
> >>>> >
> >>>> > >>> I do feel that this correspondence is a bit difficult to control
> >>>> at the moment, and it would be best if a rough table could be
> provided.
> >>>> >
> >>>> > I think this is a good suggestion, we can provide those suggestions
> >>>> in the document.
> >>>> >
> >>>> > Best,
> >>>> > Yingjie
> >>>> >
> >>>> > Jingsong Li <jingsongl...@gmail.com> 于2021年12月14日周二 14:39写道:
> >>>> >>
> >>>> >> Hi  Yingjie,
> >>>> >>
> >>>> >> +1 for this FLIP. I'm pretty sure this will greatly improve the
> ease
> >>>> >> of batch jobs.
> >>>> >>
> >>>> >> Looks like
> "taskmanager.memory.framework.off-heap.batch-shuffle.size"
> >>>> >> and "taskmanager.network.sort-shuffle.min-buffers" are related to
> >>>> >> network memory and framework.off-heap.size.
> >>>> >>
> >>>> >> My question is, what is the maximum parallelism a job can have with
> >>>> >> the default configuration? (Does this break out of the box)
> >>>> >>
> >>>> >> How much network memory and framework.off-heap.size are required
> for
> >>>> >> how much parallelism in the default configuration?
> >>>> >>
> >>>> >> I do feel that this correspondence is a bit difficult to control at
> >>>> >> the moment, and it would be best if a rough table could be
> provided.
> >>>> >>
> >>>> >> Best,
> >>>> >> Jingsong
> >>>> >>
> >>>> >> On Tue, Dec 14, 2021 at 2:16 PM Yingjie Cao <
> kevin.ying...@gmail.com>
> >>>> wrote:
> >>>> >> >
> >>>> >> > Hi Jiangang,
> >>>> >> >
> >>>> >> > Thanks for your suggestion.
> >>>> >> >
> >>>> >> > >>> The config can affect the memory usage. Will the related
> >>>> memory configs be changed?
> >>>> >> >
> >>>> >> > I think we will not change the default network memory settings.
> My
> >>>> best expectation is that the default value can work for most cases
> (though
> >>>> may not the best) and for other cases, user may need to tune the
> memory
> >>>> settings.
> >>>> >> >
> >>>> >> > >>> Can you share the tpcds results for different configs?
> >>>> Although we change the default values, it is helpful to change them
> for
> >>>> different users. In this case, the experience can help a lot.
> >>>> >> >
> >>>> >> > I did not keep all previous TPCDS results, but from the results,
> I
> >>>> can tell that on HDD, always using the sort-shuffle is a good choice.
> For
> >>>> small jobs, using sort-shuffle may not bring much performance gain,
> this
> >>>> may because that all shuffle data can be cached in memory (page
> cache),
> >>>> this is the case if the cluster have enough resources. However, if the
> >>>> whole cluster is under heavy burden or you are running large scale
> jobs,
> >>>> the performance of those small jobs can also be influenced. For
> large-scale
> >>>> jobs, the configurations suggested to be tuned are
> >>>> taskmanager.network.sort-shuffle.min-buffers and
> >>>> taskmanager.memory.framework.off-heap.batch-shuffle.size, you can
> increase
> >>>> these values for large-scale batch jobs.
> >>>> >> >
> >>>> >> > BTW, I am still running TPCDS tests these days and I can share
> >>>> these results soon.
> >>>> >> >
> >>>> >> > Best,
> >>>> >> > Yingjie
> >>>> >> >
> >>>> >> > 刘建刚 <liujiangangp...@gmail.com> 于2021年12月10日周五 18:30写道:
> >>>> >> >>
> >>>> >> >> Glad to see the suggestion. In our test, we found that small
> jobs
> >>>> with the changing configs can not improve the performance much just
> as your
> >>>> test. I have some suggestions:
> >>>> >> >>
> >>>> >> >> The config can affect the memory usage. Will the related memory
> >>>> configs be changed?
> >>>> >> >> Can you share the tpcds results for different configs? Although
> >>>> we change the default values, it is helpful to change them for
> different
> >>>> users. In this case, the experience can help a lot.
> >>>> >> >>
> >>>> >> >> Best,
> >>>> >> >> Liu Jiangang
> >>>> >> >>
> >>>> >> >> Yun Gao <yungao...@aliyun.com.invalid> 于2021年12月10日周五 17:20写道:
> >>>> >> >>>
> >>>> >> >>> Hi Yingjie,
> >>>> >> >>>
> >>>> >> >>> Very thanks for drafting the FLIP and initiating the
> discussion!
> >>>> >> >>>
> >>>> >> >>> May I have a double confirmation for
> >>>> taskmanager.network.sort-shuffle.min-parallelism that
> >>>> >> >>> since other frameworks like Spark have used sort-based shuffle
> >>>> for all the cases, does our
> >>>> >> >>> current circumstance still have difference with them?
> >>>> >> >>>
> >>>> >> >>> Best,
> >>>> >> >>> Yun
> >>>> >> >>>
> >>>> >> >>>
> >>>> >> >>>
> >>>> >> >>>
> >>>> >> >>>
> >>>> ------------------------------------------------------------------
> >>>> >> >>> From:Yingjie Cao <kevin.ying...@gmail.com>
> >>>> >> >>> Send Time:2021 Dec. 10 (Fri.) 16:17
> >>>> >> >>> To:dev <dev@flink.apache.org>; user <u...@flink.apache.org>;
> >>>> user-zh <user...@flink.apache.org>
> >>>> >> >>> Subject:Re: [DISCUSS] Change some default config values of
> >>>> blocking shuffle
> >>>> >> >>>
> >>>> >> >>> Hi dev & users:
> >>>> >> >>>
> >>>> >> >>> I have created a FLIP [1] for it, feedbacks are highly
> >>>> appreciated.
> >>>> >> >>>
> >>>> >> >>> Best,
> >>>> >> >>> Yingjie
> >>>> >> >>>
> >>>> >> >>> [1]
> >>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-199%3A+Change+some+default+config+values+of+blocking+shuffle+for+better+usability
> >>>> >> >>> Yingjie Cao <kevin.ying...@gmail.com> 于2021年12月3日周五 17:02写道:
> >>>> >> >>>
> >>>> >> >>> Hi dev & users,
> >>>> >> >>>
> >>>> >> >>> We propose to change some default values of blocking shuffle to
> >>>> improve the user out-of-box experience (not influence streaming). The
> >>>> default values we want to change are as follows:
> >>>> >> >>>
> >>>> >> >>> 1. Data compression
> >>>> (taskmanager.network.blocking-shuffle.compression.enabled):
> Currently, the
> >>>> default value is 'false'.  Usually, data compression can reduce both
> disk
> >>>> and network IO which is good for performance. At the same time, it
> can save
> >>>> storage space. We propose to change the default value to true.
> >>>> >> >>>
> >>>> >> >>> 2. Default shuffle implementation
> >>>> (taskmanager.network.sort-shuffle.min-parallelism): Currently, the
> default
> >>>> value is 'Integer.MAX', which means by default, Flink jobs will
> always use
> >>>> hash-shuffle. In fact, for high parallelism, sort-shuffle is better
> for
> >>>> both stability and performance. So we propose to reduce the default
> value
> >>>> to a proper smaller one, for example, 128. (We tested 128, 256, 512
> and
> >>>> 1024 with a tpc-ds and 128 is the best one.)
> >>>> >> >>>
> >>>> >> >>> 3. Read buffer of sort-shuffle
> >>>> (taskmanager.memory.framework.off-heap.batch-shuffle.size):
> Currently, the
> >>>> default value is '32M'. Previously, when choosing the default value,
> both
> >>>> ‘32M' and '64M' are OK for tests and we chose the smaller one in a
> cautious
> >>>> way. However, recently, it is reported in the mailing list that the
> default
> >>>> value is not enough which caused a buffer request timeout issue. We
> already
> >>>> created a ticket to improve the behavior. At the same time, we
> propose to
> >>>> increase this default value to '64M' which can also help.
> >>>> >> >>>
> >>>> >> >>> 4. Sort buffer size of sort-shuffle
> >>>> (taskmanager.network.sort-shuffle.min-buffers): Currently, the default
> >>>> value is '64' which means '64' network buffers (32k per buffer by
> default).
> >>>> This default value is quite modest and the performance can be
> influenced.
> >>>> We propose to increase this value to a larger one, for example, 512
> (the
> >>>> default TM and network buffer configuration can serve more than 10
> result
> >>>> partitions concurrently).
> >>>> >> >>>
> >>>> >> >>> We already tested these default values together with tpc-ds
> >>>> benchmark in a cluster and both the performance and stability
> improved a
> >>>> lot. These changes can help to improve the out-of-box experience of
> >>>> blocking shuffle. What do you think about these changes? Is there any
> >>>> concern? If there are no objections, I will make these changes soon.
> >>>> >> >>>
> >>>> >> >>> Best,
> >>>> >> >>> Yingjie
> >>>> >> >>>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> Best, Jingsong Lee
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>
>
>
>

Reply via email to