[ 
https://issues.apache.org/jira/browse/TEZ-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated TEZ-2200:
----------------------------
    Description: 
There are scenarios like the following wherein progressive split creation for 
the initial inputs would be a useful feature.
1) large inputs that produce lots of splits
2) multi-wave mappers where in stats from the first wave of mappers may be used 
to optimize the next wave of mappers
3) starting some mappers on partial data optimistically while waiting for 
additional split filters to be available from other vertices. Then applying 
those split filters on the remaining data in the hope that we may have to read 
less. /cc [~gopalv]
4) maintaining locality of splits as the job progresses
5) handle node failures and re-group splits for a failed node onto other local 
nodes.
6) others???

Progressive split creation would involve creating an initial set of splits (say 
with a good spatial distribution) to start the data read. Then based on stats 
from the initial reads, the next set of splits could have different heuristics 
for grouping. E.g. if the splits are taking too long to process then reduce the 
size and vice versa. New splits could be created for new mappers or created for 
existing mappers based on their location. E.g. if a mapper is already running 
on node A then send it more splits that are on node A. Potentially more 
heuristics.


  was:
There are scenarios like the following wherein progressive split creation for 
the initial inputs would be a useful feature.
1) large inputs that produce lots of splits
2) multi-wave mappers where in stats from the first wave of mappers may be used 
to optimize the next wave of mappers
3) starting some mappers on partial data optimistically while waiting for 
additional split filters to be available from other vertices. Then applying 
those split filters on the remaining data in the hope that we may have to read 
less. /cc [~gopalv]
4) maintaining locality of splits as the job progresses
5) others???

Progressive split creation would involve creating an initial set of splits (say 
with a good spatial distribution) to start the data read. Then based on stats 
from the initial reads, the next set of splits could have different heuristics 
for grouping. E.g. if the splits are taking too long to process then reduce the 
size and vice versa. New splits could be created for new mappers or created for 
existing mappers based on their location. E.g. if a mapper is already running 
on node A then send it more splits that are on node A. Potentially more 
heuristics.



> Progressive input split creation and grouping
> ---------------------------------------------
>
>                 Key: TEZ-2200
>                 URL: https://issues.apache.org/jira/browse/TEZ-2200
>             Project: Apache Tez
>          Issue Type: Task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>
> There are scenarios like the following wherein progressive split creation for 
> the initial inputs would be a useful feature.
> 1) large inputs that produce lots of splits
> 2) multi-wave mappers where in stats from the first wave of mappers may be 
> used to optimize the next wave of mappers
> 3) starting some mappers on partial data optimistically while waiting for 
> additional split filters to be available from other vertices. Then applying 
> those split filters on the remaining data in the hope that we may have to 
> read less. /cc [~gopalv]
> 4) maintaining locality of splits as the job progresses
> 5) handle node failures and re-group splits for a failed node onto other 
> local nodes.
> 6) others???
> Progressive split creation would involve creating an initial set of splits 
> (say with a good spatial distribution) to start the data read. Then based on 
> stats from the initial reads, the next set of splits could have different 
> heuristics for grouping. E.g. if the splits are taking too long to process 
> then reduce the size and vice versa. New splits could be created for new 
> mappers or created for existing mappers based on their location. E.g. if a 
> mapper is already running on node A then send it more splits that are on node 
> A. Potentially more heuristics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to