[
https://issues.apache.org/jira/browse/TEZ-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bikas Saha updated TEZ-2200:
----------------------------
Description:
There are scenarios like the following wherein progressive split creation for
the initial inputs would be a useful feature.
1) large inputs that produce lots of splits
2) multi-wave mappers where in stats from the first wave of mappers may be used
to optimize the next wave of mappers
3) starting some mappers on partial data optimistically while waiting for
additional split filters to be available from other vertices. Then applying
those split filters on the remaining data in the hope that we may have to read
less. /cc [~gopalv]
4) maintaining locality of splits as the job progresses
5) handle node failures and re-group splits for a failed node onto other local
nodes.
6) others???
Progressive split creation would involve creating an initial set of splits (say
with a good spatial distribution) to start the data read. Then based on stats
from the initial reads, the next set of splits could have different heuristics
for grouping. E.g. if the splits are taking too long to process then reduce the
size and vice versa. New splits could be created for new mappers or created for
existing mappers based on their location. E.g. if a mapper is already running
on node A then send it more splits that are on node A. Potentially more
heuristics.
was:
There are scenarios like the following wherein progressive split creation for
the initial inputs would be a useful feature.
1) large inputs that produce lots of splits
2) multi-wave mappers where in stats from the first wave of mappers may be used
to optimize the next wave of mappers
3) starting some mappers on partial data optimistically while waiting for
additional split filters to be available from other vertices. Then applying
those split filters on the remaining data in the hope that we may have to read
less. /cc [~gopalv]
4) maintaining locality of splits as the job progresses
5) others???
Progressive split creation would involve creating an initial set of splits (say
with a good spatial distribution) to start the data read. Then based on stats
from the initial reads, the next set of splits could have different heuristics
for grouping. E.g. if the splits are taking too long to process then reduce the
size and vice versa. New splits could be created for new mappers or created for
existing mappers based on their location. E.g. if a mapper is already running
on node A then send it more splits that are on node A. Potentially more
heuristics.
> Progressive input split creation and grouping
> ---------------------------------------------
>
> Key: TEZ-2200
> URL: https://issues.apache.org/jira/browse/TEZ-2200
> Project: Apache Tez
> Issue Type: Task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
>
> There are scenarios like the following wherein progressive split creation for
> the initial inputs would be a useful feature.
> 1) large inputs that produce lots of splits
> 2) multi-wave mappers where in stats from the first wave of mappers may be
> used to optimize the next wave of mappers
> 3) starting some mappers on partial data optimistically while waiting for
> additional split filters to be available from other vertices. Then applying
> those split filters on the remaining data in the hope that we may have to
> read less. /cc [~gopalv]
> 4) maintaining locality of splits as the job progresses
> 5) handle node failures and re-group splits for a failed node onto other
> local nodes.
> 6) others???
> Progressive split creation would involve creating an initial set of splits
> (say with a good spatial distribution) to start the data read. Then based on
> stats from the initial reads, the next set of splits could have different
> heuristics for grouping. E.g. if the splits are taking too long to process
> then reduce the size and vice versa. New splits could be created for new
> mappers or created for existing mappers based on their location. E.g. if a
> mapper is already running on node A then send it more splits that are on node
> A. Potentially more heuristics.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)