[
https://issues.apache.org/jira/browse/TAJO-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104953#comment-14104953
]
Mai Hai Thanh commented on TAJO-983:
------------------------------------
Thank [~hyunsik]!
I investigated your interesting approach. However, I think that the effects of
the two approaches are the same. In case there are multiple file chunks (which
is almost always the case with not-so-small data), we have to merge them into
one file, which can be a file to be returned by Fetcher::get() or a
FileFragment. To merge multiple chunks, copying data is unavoidable. In case
there is only 1 file chunk to be fetched and this chunk does not represents a
complete file, we also have to copy the data. In case there is only 1 file
chunk to be fetched and this chunk represents a complete file (this case
happens only with very small data), we should theoretically avoid copying and
use the file directly. Nevertheless, we have to treat this file as an
exceptional case in later processing code because it is not stored in the
conventional default folder and with the conventional file name format. Beside,
because chunk size is by default limited to be only 8 KB, the copying of data
is not a problem. So, to keep the code clean for ease of maintenance and
because of the low cost (also, rare case), I prefer to keep the current
approach.
> Worker should directly read Intermediate data stored in localhost rather than
> fetching
> --------------------------------------------------------------------------------------
>
> Key: TAJO-983
> URL: https://issues.apache.org/jira/browse/TAJO-983
> Project: Tajo
> Issue Type: Bug
> Components: data shuffle
> Reporter: Hyunsik Choi
> Assignee: Mai Hai Thanh
> Attachments: TAJO-983.140820.0.patch.txt
>
>
> Currently, worker always fetches all intermediate via Fetcher and than store
> them in local file system even though some intermediate data already are
> stored in local file system. It is inefficient and causes unnecessary I/O and
> extra storage occupation. We should improve it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)