[ 
https://issues.apache.org/jira/browse/TAJO-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105202#comment-14105202
 ] 

Hyunsik Choi commented on TAJO-983:
-----------------------------------

Yes, you are right. Most of intermediate data are stored in remote hosts, and 
some data are stored in local disk. Theoretically, if some data of a query are 
evenly shuffled into N partitions (intermediate data) in the cluster where a 
number of cluster nodes N, 1/N data will be stored in the local disk.

If tens of cluster nodes handle several tera bytes, we usually can see tens of 
giga bytes of aggregated intermediate data across the cluster nodes; some ETL 
jobs handles a large volume of intermediate data similar to that of the input 
data. Even though only some of them are stored in local disks, it would be 
better if we avoid unnecessary reads and writes.

Best regards,
Hyunsik

> Worker should directly read Intermediate data stored in localhost rather than 
> fetching
> --------------------------------------------------------------------------------------
>
>                 Key: TAJO-983
>                 URL: https://issues.apache.org/jira/browse/TAJO-983
>             Project: Tajo
>          Issue Type: Bug
>          Components: data shuffle
>            Reporter: Hyunsik Choi
>            Assignee: Mai Hai Thanh
>         Attachments: TAJO-983.140820.0.patch.txt
>
>
> Currently, worker always fetches all intermediate via Fetcher and than store 
> them in local file system even though some intermediate data already are  
> stored in local file system. It is inefficient and causes unnecessary I/O and 
> extra storage occupation. We should improve it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to