Bikas Saha created TEZ-2200:
-------------------------------
Summary: Progressive input split creation and grouping
Key: TEZ-2200
URL: https://issues.apache.org/jira/browse/TEZ-2200
Project: Apache Tez
Issue Type: Task
Reporter: Bikas Saha
Assignee: Bikas Saha
There are scenarios like the following wherein progressive split creation for
the initial inputs would be a useful feature.
1) large inputs that produce lots of splits
2) multi-wave mappers where in stats from the first wave of mappers may be used
to optimize the next wave of mappers
3) starting some mappers on partial data optimistically while waiting for
additional split filters to be available from other vertices. Then applying
those split filters on the remaining data in the hope that we may have to read
less. /cc [~gopalv]
4) maintaining locality of splits as the job progresses
5) others???
Progressive split creation would involve creating an initial set of splits (say
with a good spatial distribution) to start the data read. Then based on stats
from the initial reads, the next set of splits could have different heuristics
for grouping. E.g. if the splits are taking too long to process then reduce the
size and vice versa. New splits could be created for new mappers or created for
existing mappers based on their location. E.g. if a mapper is already running
on node A then send it more splits that are on node A. Potentially more
heuristics.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)