[
https://issues.apache.org/jira/browse/TAJO-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105028#comment-14105028
]
Hyunsik Choi commented on TAJO-983:
-----------------------------------
Hi Mai,
The main reason why a task reads directly reads intermediate data without
copying is to reduce extra storage occupation and unnecessary I/O overheads. In
practice, intermediate data can be usually from few bytes up to hundreds of
mega bytes. Some of them in general are already stored in local disk. For large
intermediate, I believe that non copy approach is worthy.
Also, a task which consumes intermediate data uses MergeScanner which reads
multiple file fragments iteratively. So, we don't need to merge input files
into one file.
In addition, please take a look at TAJO-992 for per-node shuffler. Before
TAJO-992, in hash shuffle, each task makes a number of small files. So, Fetcher
merged small chunks stored in a number of small partitions into one stream and
pulls the stream. After TAJO-992, each node makes single intermediate data file
for each partition. A task fetches corresponding partition files from a number
of workers and read directly them without merging them. So, the cost of reading
a number of small files became cheap.
Thanks,
Hyunsik
> Worker should directly read Intermediate data stored in localhost rather than
> fetching
> --------------------------------------------------------------------------------------
>
> Key: TAJO-983
> URL: https://issues.apache.org/jira/browse/TAJO-983
> Project: Tajo
> Issue Type: Bug
> Components: data shuffle
> Reporter: Hyunsik Choi
> Assignee: Mai Hai Thanh
> Attachments: TAJO-983.140820.0.patch.txt
>
>
> Currently, worker always fetches all intermediate via Fetcher and than store
> them in local file system even though some intermediate data already are
> stored in local file system. It is inefficient and causes unnecessary I/O and
> extra storage occupation. We should improve it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)