[
https://issues.apache.org/jira/browse/TAJO-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105202#comment-14105202
]
Hyunsik Choi commented on TAJO-983:
-----------------------------------
Yes, you are right. Most of intermediate data are stored in remote hosts, and
some data are stored in local disk. Theoretically, if some data of a query are
evenly shuffled into N partitions (intermediate data) in the cluster where a
number of cluster nodes N, 1/N data will be stored in the local disk.
If tens of cluster nodes handle several tera bytes, we usually can see tens of
giga bytes of aggregated intermediate data across the cluster nodes; some ETL
jobs handles a large volume of intermediate data similar to that of the input
data. Even though only some of them are stored in local disks, it would be
better if we avoid unnecessary reads and writes.
Best regards,
Hyunsik
> Worker should directly read Intermediate data stored in localhost rather than
> fetching
> --------------------------------------------------------------------------------------
>
> Key: TAJO-983
> URL: https://issues.apache.org/jira/browse/TAJO-983
> Project: Tajo
> Issue Type: Bug
> Components: data shuffle
> Reporter: Hyunsik Choi
> Assignee: Mai Hai Thanh
> Attachments: TAJO-983.140820.0.patch.txt
>
>
> Currently, worker always fetches all intermediate via Fetcher and than store
> them in local file system even though some intermediate data already are
> stored in local file system. It is inefficient and causes unnecessary I/O and
> extra storage occupation. We should improve it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)