[ 
https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337141#comment-17337141
 ] 

Till Rohrmann commented on FLINK-10644:
---------------------------------------

Thanks a lot for sharing the details of the speculative execution feature 
[~wangwj]. It is really cool that you made it working and it is nice to see 
that it improves the overall batch execution performance.

Which is the closest version of Flink to the Blink version you built this 
feature on? I am just curious to learn how far the code bases might have 
diverged or not. You said that there is some multi threading in the 
{{ExecutionGraph}}. We actually removed all concurrency from this component 
since a couple of versions.

How does the speculative execution play together with other sinks? Does it only 
work for the file based sinks?

How does the blacklisting mechanism work? Does it work also for the K8s and 
Mesos integration or only for the Yarn integration?

How much is the change encapsulated by the {{SchedulerNG}} interface? If it is 
more or less self-contained, then one could think about adding the speculative 
scheduler as a new scheduler option.

I think this is a cool feature and the next step could be to better understand 
how vanilla Flink differs from the used Blink version. Moreover, if we decide 
to contribute the feature back, then we need a FLIP and a vote on it. 
Concerning the contributing things back we have to see a bit what the current 
plans of the community are. We are about to start the 1.14 release and many 
community members have already decided what they want to do.

> Batch Job: Speculative execution
> --------------------------------
>
>                 Key: FLINK-10644
>                 URL: https://issues.apache.org/jira/browse/FLINK-10644
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: JIN SUN
>            Assignee: BoWang
>            Priority: Major
>              Labels: stale-assigned
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a 
> Batch Job, this somehow impact job latency, as pretty much this straggler 
> will be in the critical path of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or 
> software mis-configuration, or noise neighboring. It's hard for JM to predict 
> the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
> has *_speculative execution_*. Speculative execution is a health-check 
> procedure that checks for tasks to be speculated, i.e. running slower in a 
> ExecutionJobVertex than the median of all successfully completed tasks in 
> that EJV, Such slow tasks will be re-submitted to another TM. It will not 
> stop the slow tasks, but run a new copy in parallel. And will kill the others 
> if one of them complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
> append later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to