Hi,

Flink starts four tasks and then lazily assigns input splits to these tasks
with locality preference. So each task may consume more than one split.
This is different from Hadoop MapReduce or Spark which schedule a new task
for each input split.
In your case, the four tasks would be scheduled to four of the 40 machines
and most of the splits will be remotely read.

Best, Fabian


2016-04-26 16:59 GMT+02:00 CPC <acha...@gmail.com>:

> Hi,
>
> I look at some scheduler documentations but could not find answer to my
> question. My question is: suppose that i have a big file on 40 node hadoop
> cluster and since it is a big file every node has at least one chunk of the
> file. If i write a flink job and want to filter file and if job has
> parelelism of 4(less that 40 actually) how datalocality is working? Does
> some tasks read some chunks from remote nodes? Or scheduler schedule tasks
> in way that keeping max paralelism at 4 but schedule tasks on every node?
>
> Regards
>

Reply via email to