Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-25 Thread Zhu Zhu
We will then keep the decision that we do not support customized restart
strategy in Flink 1.10.

Thanks Steven for the inputs!

Thanks,
Zhu Zhu

Steven Wu  于2019年9月26日周四 上午12:13写道:

> Zhu Zhu, that is correct.
>
> On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu  wrote:
>
>> Hi Steven,
>>
>> As a conclusion, since we will have a meter metric[1] for restarts,
>> customized restart strategy is not needed in your case.
>> Is that right?
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu  于2019年9月25日周三 上午2:30写道:
>>
>>> Zhu Zhu,
>>>
>>> Sorry, I was using different terminology. yes, Flink meter is what I was
>>> talking about regarding "fullRestarts" for threshold based alerting.
>>>
>>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu  wrote:
>>>
 Steven,

 In my mind, Flink counter only stores its accumulated count and reports
 that value. Are you using an external counter directly?
 Maybe Flink Meter/MeterView is what you need? It stores the count and
 calculates the rate. And it will report its "count" as well as "rate" to
 external metric services.

 The counter "task_failures" only works if the individual failover
 strategy is enabled. However, it is not a public interface and is not
 suggested to use, as the fine grained recovery (region failover) now
 supersedes it.
 I've opened a ticket[1] to add a metric to show failovers that respects
 fine grained recovery.

 [1] https://issues.apache.org/jira/browse/FLINK-14164

 Thanks,
 Zhu Zhu

 Steven Wu  于2019年9月24日周二 上午6:41写道:

>
> When we setup alert like "fullRestarts > 1" for some rolling window,
> we want to use counter. if it is a Gauge, "fullRestarts" will never go
> below 1 after a first full restart. So alert condition will always be true
> after first job restart. If we can apply a derivative to the Gauge value, 
> I
> guess alert can probably work. I can explore if that is an option or not.
>
> Yeah. Understood that "fullRestart" won't increment when fine grained
> recovery happened. I think "task_failures" counter already exists in 
> Flink.
>
>
>
> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu  wrote:
>
>> Steven,
>>
>> Thanks for the information. If we can determine this a common issue,
>> we can solve it in Flink core.
>> To get to that state, I have two questions which need your help:
>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>> Gauge. Does the metric reporter you use report Counter and
>> Gauge to external services in different ways? Or anything else can 
>> be
>> different due to the metric type?
>> 2. Is the "number of restarts" what you actually need, rather than
>> the "fullRestart" count? If so, I believe we will have such a counter
>> metric in 1.10, since the previous "fullRestart" metric value is not the
>> number of restarts when grained recovery (feature added 1.9.0) is 
>> enabled.
>> "fullRestart" reveals how many times entire job graph has been
>> restarted. If grained recovery (feature added 1.9.0) is enabled, the 
>> graph
>> would not be restarted when task failures happen and the "fullRestart"
>> value will not increment in such cases.
>>
>> I'd appreciate if you can help with these questions and we can make
>> better decisions for Flink.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu  于2019年9月22日周日 上午3:31写道:
>>
>>> Zhu Zhu,
>>>
>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:
>>>
 Thanks Steven for the feedback!
 Could you share more information about the metrics you add in you
 customized restart strategy?

 Thanks,
 Zhu Zhu

 Steven Wu  于2019年9月20日周五 上午7:11写道:

> We do use config like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
> metrics than the Flink provided ones.
>
> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:
>
>> Thanks everyone for the input.
>>
>> The RestartStrategy customization is not recognized as a public
>> interface as it is not explicitly documented.
>> As it is not used from the feedbacks of this survey, I'll
>> conclude that we do not need to support customized RestartStrategy 
>> for the
>> new scheduler in Flink 1.10
>>
>> Other usages are still supported, including all the strategies
>> and configuring ways described in
>> 

Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-25 Thread Steven Wu
Zhu Zhu, that is correct.

On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu  wrote:

> Hi Steven,
>
> As a conclusion, since we will have a meter metric[1] for restarts,
> customized restart strategy is not needed in your case.
> Is that right?
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu  于2019年9月25日周三 上午2:30写道:
>
>> Zhu Zhu,
>>
>> Sorry, I was using different terminology. yes, Flink meter is what I was
>> talking about regarding "fullRestarts" for threshold based alerting.
>>
>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu  wrote:
>>
>>> Steven,
>>>
>>> In my mind, Flink counter only stores its accumulated count and reports
>>> that value. Are you using an external counter directly?
>>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>>> calculates the rate. And it will report its "count" as well as "rate" to
>>> external metric services.
>>>
>>> The counter "task_failures" only works if the individual failover
>>> strategy is enabled. However, it is not a public interface and is not
>>> suggested to use, as the fine grained recovery (region failover) now
>>> supersedes it.
>>> I've opened a ticket[1] to add a metric to show failovers that respects
>>> fine grained recovery.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu  于2019年9月24日周二 上午6:41写道:
>>>

 When we setup alert like "fullRestarts > 1" for some rolling window, we
 want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
 after a first full restart. So alert condition will always be true after
 first job restart. If we can apply a derivative to the Gauge value, I guess
 alert can probably work. I can explore if that is an option or not.

 Yeah. Understood that "fullRestart" won't increment when fine grained
 recovery happened. I think "task_failures" counter already exists in Flink.



 On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu  wrote:

> Steven,
>
> Thanks for the information. If we can determine this a common issue,
> we can solve it in Flink core.
> To get to that state, I have two questions which need your help:
> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
> Gauge. Does the metric reporter you use report Counter and
> Gauge to external services in different ways? Or anything else can 
> be
> different due to the metric type?
> 2. Is the "number of restarts" what you actually need, rather than
> the "fullRestart" count? If so, I believe we will have such a counter
> metric in 1.10, since the previous "fullRestart" metric value is not the
> number of restarts when grained recovery (feature added 1.9.0) is enabled.
> "fullRestart" reveals how many times entire job graph has been
> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
> would not be restarted when task failures happen and the "fullRestart"
> value will not increment in such cases.
>
> I'd appreciate if you can help with these questions and we can make
> better decisions for Flink.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu  于2019年9月22日周日 上午3:31写道:
>
>> Zhu Zhu,
>>
>> Flink fullRestart metric is a Gauge, which is not good for alerting
>> on. We publish an equivalent Counter metric for alerting purpose.
>>
>> Thanks,
>> Steven
>>
>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:
>>
>>> Thanks Steven for the feedback!
>>> Could you share more information about the metrics you add in you
>>> customized restart strategy?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu  于2019年9月20日周五 上午7:11写道:
>>>
 We do use config like "restart-strategy:
 org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
 metrics than the Flink provided ones.

 On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:

> Thanks everyone for the input.
>
> The RestartStrategy customization is not recognized as a public
> interface as it is not explicitly documented.
> As it is not used from the feedbacks of this survey, I'll conclude
> that we do not need to support customized RestartStrategy for the new
> scheduler in Flink 1.10
>
> Other usages are still supported, including all the strategies and
> configuring ways described in
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
> .
>
> Feel free to share in this thread if you has any concern for it.
>
> Thanks,
> Zhu Zhu
>
> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>
>> Thanks Oytun for the reply!
>>
>> Sorry for not have stated it clearly. When saying "customized

Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-24 Thread Zhu Zhu
Hi Steven,

As a conclusion, since we will have a meter metric[1] for restarts,
customized restart strategy is not needed in your case.
Is that right?

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu  于2019年9月25日周三 上午2:30写道:

> Zhu Zhu,
>
> Sorry, I was using different terminology. yes, Flink meter is what I was
> talking about regarding "fullRestarts" for threshold based alerting.
>
> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu  wrote:
>
>> Steven,
>>
>> In my mind, Flink counter only stores its accumulated count and reports
>> that value. Are you using an external counter directly?
>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>> calculates the rate. And it will report its "count" as well as "rate" to
>> external metric services.
>>
>> The counter "task_failures" only works if the individual failover
>> strategy is enabled. However, it is not a public interface and is not
>> suggested to use, as the fine grained recovery (region failover) now
>> supersedes it.
>> I've opened a ticket[1] to add a metric to show failovers that respects
>> fine grained recovery.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu  于2019年9月24日周二 上午6:41写道:
>>
>>>
>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>> after a first full restart. So alert condition will always be true after
>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>> alert can probably work. I can explore if that is an option or not.
>>>
>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>
>>>
>>>
>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu  wrote:
>>>
 Steven,

 Thanks for the information. If we can determine this a common issue, we
 can solve it in Flink core.
 To get to that state, I have two questions which need your help:
 1. Why is gauge not good for alerting? The metric "fullRestart" is a
 Gauge. Does the metric reporter you use report Counter and
 Gauge to external services in different ways? Or anything else can be
 different due to the metric type?
 2. Is the "number of restarts" what you actually need, rather than
 the "fullRestart" count? If so, I believe we will have such a counter
 metric in 1.10, since the previous "fullRestart" metric value is not the
 number of restarts when grained recovery (feature added 1.9.0) is enabled.
 "fullRestart" reveals how many times entire job graph has been
 restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
 would not be restarted when task failures happen and the "fullRestart"
 value will not increment in such cases.

 I'd appreciate if you can help with these questions and we can make
 better decisions for Flink.

 Thanks,
 Zhu Zhu

 Steven Wu  于2019年9月22日周日 上午3:31写道:

> Zhu Zhu,
>
> Flink fullRestart metric is a Gauge, which is not good for alerting
> on. We publish an equivalent Counter metric for alerting purpose.
>
> Thanks,
> Steven
>
> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:
>
>> Thanks Steven for the feedback!
>> Could you share more information about the metrics you add in you
>> customized restart strategy?
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu  于2019年9月20日周五 上午7:11写道:
>>
>>> We do use config like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>> metrics than the Flink provided ones.
>>>
>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:
>>>
 Thanks everyone for the input.

 The RestartStrategy customization is not recognized as a public
 interface as it is not explicitly documented.
 As it is not used from the feedbacks of this survey, I'll conclude
 that we do not need to support customized RestartStrategy for the new
 scheduler in Flink 1.10

 Other usages are still supported, including all the strategies and
 configuring ways described in
 https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
 .

 Feel free to share in this thread if you has any concern for it.

 Thanks,
 Zhu Zhu

 Zhu Zhu  于2019年9月12日周四 下午10:33写道:

> Thanks Oytun for the reply!
>
> Sorry for not have stated it clearly. When saying "customized
> RestartStrategy", we mean that users implement an
> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
> by themselves and use it by configuring like "restart-strategy:
> 

Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-24 Thread Steven Wu
Zhu Zhu,

Sorry, I was using different terminology. yes, Flink meter is what I was
talking about regarding "fullRestarts" for threshold based alerting.

On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu  wrote:

> Steven,
>
> In my mind, Flink counter only stores its accumulated count and reports
> that value. Are you using an external counter directly?
> Maybe Flink Meter/MeterView is what you need? It stores the count and
> calculates the rate. And it will report its "count" as well as "rate" to
> external metric services.
>
> The counter "task_failures" only works if the individual failover strategy
> is enabled. However, it is not a public interface and is not suggested to
> use, as the fine grained recovery (region failover) now supersedes it.
> I've opened a ticket[1] to add a metric to show failovers that respects
> fine grained recovery.
>
> [1] https://issues.apache.org/jira/browse/FLINK-14164
>
> Thanks,
> Zhu Zhu
>
> Steven Wu  于2019年9月24日周二 上午6:41写道:
>
>>
>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>> after a first full restart. So alert condition will always be true after
>> first job restart. If we can apply a derivative to the Gauge value, I guess
>> alert can probably work. I can explore if that is an option or not.
>>
>> Yeah. Understood that "fullRestart" won't increment when fine grained
>> recovery happened. I think "task_failures" counter already exists in Flink.
>>
>>
>>
>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu  wrote:
>>
>>> Steven,
>>>
>>> Thanks for the information. If we can determine this a common issue, we
>>> can solve it in Flink core.
>>> To get to that state, I have two questions which need your help:
>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>> Gauge. Does the metric reporter you use report Counter and
>>> Gauge to external services in different ways? Or anything else can be
>>> different due to the metric type?
>>> 2. Is the "number of restarts" what you actually need, rather than
>>> the "fullRestart" count? If so, I believe we will have such a counter
>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>> "fullRestart" reveals how many times entire job graph has been
>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>> would not be restarted when task failures happen and the "fullRestart"
>>> value will not increment in such cases.
>>>
>>> I'd appreciate if you can help with these questions and we can make
>>> better decisions for Flink.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu  于2019年9月22日周日 上午3:31写道:
>>>
 Zhu Zhu,

 Flink fullRestart metric is a Gauge, which is not good for alerting on.
 We publish an equivalent Counter metric for alerting purpose.

 Thanks,
 Steven

 On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:

> Thanks Steven for the feedback!
> Could you share more information about the metrics you add in you
> customized restart strategy?
>
> Thanks,
> Zhu Zhu
>
> Steven Wu  于2019年9月20日周五 上午7:11写道:
>
>> We do use config like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>> metrics than the Flink provided ones.
>>
>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:
>>
>>> Thanks everyone for the input.
>>>
>>> The RestartStrategy customization is not recognized as a public
>>> interface as it is not explicitly documented.
>>> As it is not used from the feedbacks of this survey, I'll conclude
>>> that we do not need to support customized RestartStrategy for the new
>>> scheduler in Flink 1.10
>>>
>>> Other usages are still supported, including all the strategies and
>>> configuring ways described in
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>> .
>>>
>>> Feel free to share in this thread if you has any concern for it.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>>>
 Thanks Oytun for the reply!

 Sorry for not have stated it clearly. When saying "customized
 RestartStrategy", we mean that users implement an
 *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
 by themselves and use it by configuring like "restart-strategy:
 org.foobar.MyRestartStrategyFactoryFactory".

 The usage of restart strategies you mentioned will keep working
 with the new scheduler.

 Thanks,
 Zhu Zhu

 Oytun Tez  于2019年9月12日周四 下午10:05写道:

> Hi Zhu,
>
> We are using custom restart strategy like this:
>
> 

Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-23 Thread Zhu Zhu
Steven,

In my mind, Flink counter only stores its accumulated count and reports
that value. Are you using an external counter directly?
Maybe Flink Meter/MeterView is what you need? It stores the count and
calculates the rate. And it will report its "count" as well as "rate" to
external metric services.

The counter "task_failures" only works if the individual failover strategy
is enabled. However, it is not a public interface and is not suggested to
use, as the fine grained recovery (region failover) now supersedes it.
I've opened a ticket[1] to add a metric to show failovers that respects
fine grained recovery.

[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu  于2019年9月24日周二 上午6:41写道:

>
> When we setup alert like "fullRestarts > 1" for some rolling window, we
> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
> after a first full restart. So alert condition will always be true after
> first job restart. If we can apply a derivative to the Gauge value, I guess
> alert can probably work. I can explore if that is an option or not.
>
> Yeah. Understood that "fullRestart" won't increment when fine grained
> recovery happened. I think "task_failures" counter already exists in Flink.
>
>
>
> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu  wrote:
>
>> Steven,
>>
>> Thanks for the information. If we can determine this a common issue, we
>> can solve it in Flink core.
>> To get to that state, I have two questions which need your help:
>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>> Gauge. Does the metric reporter you use report Counter and
>> Gauge to external services in different ways? Or anything else can be
>> different due to the metric type?
>> 2. Is the "number of restarts" what you actually need, rather than
>> the "fullRestart" count? If so, I believe we will have such a counter
>> metric in 1.10, since the previous "fullRestart" metric value is not the
>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>> "fullRestart" reveals how many times entire job graph has been
>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>> would not be restarted when task failures happen and the "fullRestart"
>> value will not increment in such cases.
>>
>> I'd appreciate if you can help with these questions and we can make
>> better decisions for Flink.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu  于2019年9月22日周日 上午3:31写道:
>>
>>> Zhu Zhu,
>>>
>>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>>> We publish an equivalent Counter metric for alerting purpose.
>>>
>>> Thanks,
>>> Steven
>>>
>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:
>>>
 Thanks Steven for the feedback!
 Could you share more information about the metrics you add in you
 customized restart strategy?

 Thanks,
 Zhu Zhu

 Steven Wu  于2019年9月20日周五 上午7:11写道:

> We do use config like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
> metrics than the Flink provided ones.
>
> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:
>
>> Thanks everyone for the input.
>>
>> The RestartStrategy customization is not recognized as a public
>> interface as it is not explicitly documented.
>> As it is not used from the feedbacks of this survey, I'll conclude
>> that we do not need to support customized RestartStrategy for the new
>> scheduler in Flink 1.10
>>
>> Other usages are still supported, including all the strategies and
>> configuring ways described in
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>> .
>>
>> Feel free to share in this thread if you has any concern for it.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>>
>>> Thanks Oytun for the reply!
>>>
>>> Sorry for not have stated it clearly. When saying "customized
>>> RestartStrategy", we mean that users implement an
>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>> by themselves and use it by configuring like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>
>>> The usage of restart strategies you mentioned will keep working with
>>> the new scheduler.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Oytun Tez  于2019年9月12日周四 下午10:05写道:
>>>
 Hi Zhu,

 We are using custom restart strategy like this:

 environment.setRestartStrategy(failureRateRestart(2,
 Time.minutes(1), Time.minutes(10)));


 ---
 Oytun Tez

 *M O T A W O R D*
 The World's Fastest Human Translation Platform.
 oy...@motaword.com — www.motaword.com


 On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  

Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-23 Thread Steven Wu
When we setup alert like "fullRestarts > 1" for some rolling window, we
want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
after a first full restart. So alert condition will always be true after
first job restart. If we can apply a derivative to the Gauge value, I guess
alert can probably work. I can explore if that is an option or not.

Yeah. Understood that "fullRestart" won't increment when fine grained
recovery happened. I think "task_failures" counter already exists in Flink.



On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu  wrote:

> Steven,
>
> Thanks for the information. If we can determine this a common issue, we
> can solve it in Flink core.
> To get to that state, I have two questions which need your help:
> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
> Gauge. Does the metric reporter you use report Counter and
> Gauge to external services in different ways? Or anything else can be
> different due to the metric type?
> 2. Is the "number of restarts" what you actually need, rather than
> the "fullRestart" count? If so, I believe we will have such a counter
> metric in 1.10, since the previous "fullRestart" metric value is not the
> number of restarts when grained recovery (feature added 1.9.0) is enabled.
> "fullRestart" reveals how many times entire job graph has been
> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
> would not be restarted when task failures happen and the "fullRestart"
> value will not increment in such cases.
>
> I'd appreciate if you can help with these questions and we can make better
> decisions for Flink.
>
> Thanks,
> Zhu Zhu
>
> Steven Wu  于2019年9月22日周日 上午3:31写道:
>
>> Zhu Zhu,
>>
>> Flink fullRestart metric is a Gauge, which is not good for alerting on.
>> We publish an equivalent Counter metric for alerting purpose.
>>
>> Thanks,
>> Steven
>>
>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:
>>
>>> Thanks Steven for the feedback!
>>> Could you share more information about the metrics you add in you
>>> customized restart strategy?
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Steven Wu  于2019年9月20日周五 上午7:11写道:
>>>
 We do use config like "restart-strategy:
 org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
 metrics than the Flink provided ones.

 On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:

> Thanks everyone for the input.
>
> The RestartStrategy customization is not recognized as a public
> interface as it is not explicitly documented.
> As it is not used from the feedbacks of this survey, I'll conclude
> that we do not need to support customized RestartStrategy for the new
> scheduler in Flink 1.10
>
> Other usages are still supported, including all the strategies and
> configuring ways described in
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
> .
>
> Feel free to share in this thread if you has any concern for it.
>
> Thanks,
> Zhu Zhu
>
> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>
>> Thanks Oytun for the reply!
>>
>> Sorry for not have stated it clearly. When saying "customized
>> RestartStrategy", we mean that users implement an
>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>> themselves and use it by configuring like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory".
>>
>> The usage of restart strategies you mentioned will keep working with
>> the new scheduler.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Oytun Tez  于2019年9月12日周四 下午10:05写道:
>>
>>> Hi Zhu,
>>>
>>> We are using custom restart strategy like this:
>>>
>>> environment.setRestartStrategy(failureRateRestart(2,
>>> Time.minutes(1), Time.minutes(10)));
>>>
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> oy...@motaword.com — www.motaword.com
>>>
>>>
>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:
>>>
 Hi everyone,

 I wanted to reach out to you and ask how many of you are using a
 customized RestartStrategy[1] in production jobs.

 We are currently developing the new Flink scheduler[2] which
 interacts with restart strategies in a different way. We have to 
 re-design
 the interfaces for the new restart strategies (so called
 RestartBackoffTimeStrategy). Existing customized RestartStrategy will 
 not
 work any more with the new scheduler.

 We want to know whether we should keep the way
 to customized RestartBackoffTimeStrategy so that existing customized
 RestartStrategy can be migrated.

 I'd appreciate if you can share the status if you are
 using customized RestartStrategy. 

Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-21 Thread Steven Wu
Zhu Zhu,

Flink fullRestart metric is a Gauge, which is not good for alerting on. We
publish an equivalent Counter metric for alerting purpose.

Thanks,
Steven

On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu  wrote:

> Thanks Steven for the feedback!
> Could you share more information about the metrics you add in you
> customized restart strategy?
>
> Thanks,
> Zhu Zhu
>
> Steven Wu  于2019年9月20日周五 上午7:11写道:
>
>> We do use config like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>> metrics than the Flink provided ones.
>>
>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:
>>
>>> Thanks everyone for the input.
>>>
>>> The RestartStrategy customization is not recognized as a public
>>> interface as it is not explicitly documented.
>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>> we do not need to support customized RestartStrategy for the new scheduler
>>> in Flink 1.10
>>>
>>> Other usages are still supported, including all the strategies and
>>> configuring ways described in
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>> .
>>>
>>> Feel free to share in this thread if you has any concern for it.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>>>
 Thanks Oytun for the reply!

 Sorry for not have stated it clearly. When saying "customized
 RestartStrategy", we mean that users implement an
 *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
 themselves and use it by configuring like "restart-strategy:
 org.foobar.MyRestartStrategyFactoryFactory".

 The usage of restart strategies you mentioned will keep working with
 the new scheduler.

 Thanks,
 Zhu Zhu

 Oytun Tez  于2019年9月12日周四 下午10:05写道:

> Hi Zhu,
>
> We are using custom restart strategy like this:
>
> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
> Time.minutes(10)));
>
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oy...@motaword.com — www.motaword.com
>
>
> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:
>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask how many of you are using a
>> customized RestartStrategy[1] in production jobs.
>>
>> We are currently developing the new Flink scheduler[2] which
>> interacts with restart strategies in a different way. We have to 
>> re-design
>> the interfaces for the new restart strategies (so called
>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>> work any more with the new scheduler.
>>
>> We want to know whether we should keep the way
>> to customized RestartBackoffTimeStrategy so that existing customized
>> RestartStrategy can be migrated.
>>
>> I'd appreciate if you can share the status if you are
>> using customized RestartStrategy. That will be valuable for use to make
>> decisions.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>
>> Thanks,
>> Zhu Zhu
>>
>


Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-19 Thread Zhu Zhu
Thanks Steven for the feedback!
Could you share more information about the metrics you add in you
customized restart strategy?

Thanks,
Zhu Zhu

Steven Wu  于2019年9月20日周五 上午7:11写道:

> We do use config like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
> metrics than the Flink provided ones.
>
> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:
>
>> Thanks everyone for the input.
>>
>> The RestartStrategy customization is not recognized as a public interface
>> as it is not explicitly documented.
>> As it is not used from the feedbacks of this survey, I'll conclude that
>> we do not need to support customized RestartStrategy for the new scheduler
>> in Flink 1.10
>>
>> Other usages are still supported, including all the strategies and
>> configuring ways described in
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>> .
>>
>> Feel free to share in this thread if you has any concern for it.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>>
>>> Thanks Oytun for the reply!
>>>
>>> Sorry for not have stated it clearly. When saying "customized
>>> RestartStrategy", we mean that users implement an
>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>> themselves and use it by configuring like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>
>>> The usage of restart strategies you mentioned will keep working with the
>>> new scheduler.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Oytun Tez  于2019年9月12日周四 下午10:05写道:
>>>
 Hi Zhu,

 We are using custom restart strategy like this:

 environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
 Time.minutes(10)));


 ---
 Oytun Tez

 *M O T A W O R D*
 The World's Fastest Human Translation Platform.
 oy...@motaword.com — www.motaword.com


 On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:

> Hi everyone,
>
> I wanted to reach out to you and ask how many of you are using a
> customized RestartStrategy[1] in production jobs.
>
> We are currently developing the new Flink scheduler[2] which interacts
> with restart strategies in a different way. We have to re-design the
> interfaces for the new restart strategies (so called
> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
> work any more with the new scheduler.
>
> We want to know whether we should keep the way
> to customized RestartBackoffTimeStrategy so that existing customized
> RestartStrategy can be migrated.
>
> I'd appreciate if you can share the status if you are using customized
> RestartStrategy. That will be valuable for use to make decisions.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
> [2] https://issues.apache.org/jira/browse/FLINK-10429
>
> Thanks,
> Zhu Zhu
>



Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-19 Thread Steven Wu
We do use config like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
metrics than the Flink provided ones.

On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu  wrote:

> Thanks everyone for the input.
>
> The RestartStrategy customization is not recognized as a public interface
> as it is not explicitly documented.
> As it is not used from the feedbacks of this survey, I'll conclude that we
> do not need to support customized RestartStrategy for the new scheduler in
> Flink 1.10
>
> Other usages are still supported, including all the strategies and
> configuring ways described in
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
> .
>
> Feel free to share in this thread if you has any concern for it.
>
> Thanks,
> Zhu Zhu
>
> Zhu Zhu  于2019年9月12日周四 下午10:33写道:
>
>> Thanks Oytun for the reply!
>>
>> Sorry for not have stated it clearly. When saying "customized
>> RestartStrategy", we mean that users implement an
>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>> themselves and use it by configuring like "restart-strategy:
>> org.foobar.MyRestartStrategyFactoryFactory".
>>
>> The usage of restart strategies you mentioned will keep working with the
>> new scheduler.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Oytun Tez  于2019年9月12日周四 下午10:05写道:
>>
>>> Hi Zhu,
>>>
>>> We are using custom restart strategy like this:
>>>
>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>> Time.minutes(10)));
>>>
>>>
>>> ---
>>> Oytun Tez
>>>
>>> *M O T A W O R D*
>>> The World's Fastest Human Translation Platform.
>>> oy...@motaword.com — www.motaword.com
>>>
>>>
>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:
>>>
 Hi everyone,

 I wanted to reach out to you and ask how many of you are using a
 customized RestartStrategy[1] in production jobs.

 We are currently developing the new Flink scheduler[2] which interacts
 with restart strategies in a different way. We have to re-design the
 interfaces for the new restart strategies (so called
 RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
 work any more with the new scheduler.

 We want to know whether we should keep the way
 to customized RestartBackoffTimeStrategy so that existing customized
 RestartStrategy can be migrated.

 I'd appreciate if you can share the status if you are using customized
 RestartStrategy. That will be valuable for use to make decisions.

 [1]
 https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
 [2] https://issues.apache.org/jira/browse/FLINK-10429

 Thanks,
 Zhu Zhu

>>>


Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-19 Thread Zhu Zhu
Thanks everyone for the input.

The RestartStrategy customization is not recognized as a public interface
as it is not explicitly documented.
As it is not used from the feedbacks of this survey, I'll conclude that we
do not need to support customized RestartStrategy for the new scheduler in
Flink 1.10

Other usages are still supported, including all the strategies and
configuring ways described in
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
.

Feel free to share in this thread if you has any concern for it.

Thanks,
Zhu Zhu

Zhu Zhu  于2019年9月12日周四 下午10:33写道:

> Thanks Oytun for the reply!
>
> Sorry for not have stated it clearly. When saying "customized
> RestartStrategy", we mean that users implement an
> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
> themselves and use it by configuring like "restart-strategy:
> org.foobar.MyRestartStrategyFactoryFactory".
>
> The usage of restart strategies you mentioned will keep working with the
> new scheduler.
>
> Thanks,
> Zhu Zhu
>
> Oytun Tez  于2019年9月12日周四 下午10:05写道:
>
>> Hi Zhu,
>>
>> We are using custom restart strategy like this:
>>
>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>> Time.minutes(10)));
>>
>>
>> ---
>> Oytun Tez
>>
>> *M O T A W O R D*
>> The World's Fastest Human Translation Platform.
>> oy...@motaword.com — www.motaword.com
>>
>>
>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:
>>
>>> Hi everyone,
>>>
>>> I wanted to reach out to you and ask how many of you are using a
>>> customized RestartStrategy[1] in production jobs.
>>>
>>> We are currently developing the new Flink scheduler[2] which interacts
>>> with restart strategies in a different way. We have to re-design the
>>> interfaces for the new restart strategies (so called
>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>>> work any more with the new scheduler.
>>>
>>> We want to know whether we should keep the way
>>> to customized RestartBackoffTimeStrategy so that existing customized
>>> RestartStrategy can be migrated.
>>>
>>> I'd appreciate if you can share the status if you are using customized
>>> RestartStrategy. That will be valuable for use to make decisions.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>


Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-12 Thread Oytun Tez
Hi Zhu,

We are using custom restart strategy like this:

environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
Time.minutes(10)));


---
Oytun Tez

*M O T A W O R D*
The World's Fastest Human Translation Platform.
oy...@motaword.com — www.motaword.com


On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:

> Hi everyone,
>
> I wanted to reach out to you and ask how many of you are using a
> customized RestartStrategy[1] in production jobs.
>
> We are currently developing the new Flink scheduler[2] which interacts
> with restart strategies in a different way. We have to re-design the
> interfaces for the new restart strategies (so called
> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
> work any more with the new scheduler.
>
> We want to know whether we should keep the way
> to customized RestartBackoffTimeStrategy so that existing customized
> RestartStrategy can be migrated.
>
> I'd appreciate if you can share the status if you are using customized
> RestartStrategy. That will be valuable for use to make decisions.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
> [2] https://issues.apache.org/jira/browse/FLINK-10429
>
> Thanks,
> Zhu Zhu
>


Re: [SURVEY] How many people are using customized RestartStrategy(s)

2019-09-12 Thread Zhu Zhu
Thanks Oytun for the reply!

Sorry for not have stated it clearly. When saying "customized
RestartStrategy", we mean that users implement an
*org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
themselves and use it by configuring like "restart-strategy:
org.foobar.MyRestartStrategyFactoryFactory".

The usage of restart strategies you mentioned will keep working with the
new scheduler.

Thanks,
Zhu Zhu

Oytun Tez  于2019年9月12日周四 下午10:05写道:

> Hi Zhu,
>
> We are using custom restart strategy like this:
>
> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
> Time.minutes(10)));
>
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oy...@motaword.com — www.motaword.com
>
>
> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu  wrote:
>
>> Hi everyone,
>>
>> I wanted to reach out to you and ask how many of you are using a
>> customized RestartStrategy[1] in production jobs.
>>
>> We are currently developing the new Flink scheduler[2] which interacts
>> with restart strategies in a different way. We have to re-design the
>> interfaces for the new restart strategies (so called
>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will not
>> work any more with the new scheduler.
>>
>> We want to know whether we should keep the way
>> to customized RestartBackoffTimeStrategy so that existing customized
>> RestartStrategy can be migrated.
>>
>> I'd appreciate if you can share the status if you are using customized
>> RestartStrategy. That will be valuable for use to make decisions.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>
>> Thanks,
>> Zhu Zhu
>>
>