[ 
https://issues.apache.org/jira/browse/TAJO-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105028#comment-14105028
 ] 

Hyunsik Choi commented on TAJO-983:
-----------------------------------

Hi Mai, 

The main reason why a task reads directly reads intermediate data without 
copying is to reduce extra storage occupation and unnecessary I/O overheads. In 
practice, intermediate data can be usually from few bytes up to hundreds of 
mega bytes. Some of them in general are already stored in local disk. For large 
intermediate, I believe that non copy approach is worthy.

Also, a task which consumes intermediate data uses MergeScanner which reads 
multiple file fragments iteratively. So, we don't need to merge input files 
into one file.

In addition, please take a look at TAJO-992 for per-node shuffler. Before 
TAJO-992, in hash shuffle, each task makes a number of small files. So, Fetcher 
merged small chunks stored in a number of small partitions into one stream and 
pulls the stream. After TAJO-992, each node makes single intermediate data file 
for each partition. A task fetches corresponding partition files from a number 
of workers and read directly them without merging them. So, the cost of reading 
a number of small files became cheap.

Thanks,
Hyunsik

> Worker should directly read Intermediate data stored in localhost rather than 
> fetching
> --------------------------------------------------------------------------------------
>
>                 Key: TAJO-983
>                 URL: https://issues.apache.org/jira/browse/TAJO-983
>             Project: Tajo
>          Issue Type: Bug
>          Components: data shuffle
>            Reporter: Hyunsik Choi
>            Assignee: Mai Hai Thanh
>         Attachments: TAJO-983.140820.0.patch.txt
>
>
> Currently, worker always fetches all intermediate via Fetcher and than store 
> them in local file system even though some intermediate data already are  
> stored in local file system. It is inefficient and causes unnecessary I/O and 
> extra storage occupation. We should improve it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to