Bikas Saha created TEZ-2200:
-------------------------------

             Summary: Progressive input split creation and grouping
                 Key: TEZ-2200
                 URL: https://issues.apache.org/jira/browse/TEZ-2200
             Project: Apache Tez
          Issue Type: Task
            Reporter: Bikas Saha
            Assignee: Bikas Saha


There are scenarios like the following wherein progressive split creation for 
the initial inputs would be a useful feature.
1) large inputs that produce lots of splits
2) multi-wave mappers where in stats from the first wave of mappers may be used 
to optimize the next wave of mappers
3) starting some mappers on partial data optimistically while waiting for 
additional split filters to be available from other vertices. Then applying 
those split filters on the remaining data in the hope that we may have to read 
less. /cc [~gopalv]
4) maintaining locality of splits as the job progresses
5) others???

Progressive split creation would involve creating an initial set of splits (say 
with a good spatial distribution) to start the data read. Then based on stats 
from the initial reads, the next set of splits could have different heuristics 
for grouping. E.g. if the splits are taking too long to process then reduce the 
size and vice versa. New splits could be created for new mappers or created for 
existing mappers based on their location. E.g. if a mapper is already running 
on node A then send it more splits that are on node A. Potentially more 
heuristics.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to