[ 
https://issues.apache.org/jira/browse/TAJO-982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260913#comment-14260913
 ] 

Jihoon Son edited comment on TAJO-982 at 12/30/14 9:07 AM:
-----------------------------------------------------------

Hi guys, I have two ideas for this issue.
* When writing intermediate data for shuffle, we can merge small files into 
larger ones. I think that this is not feasible because it requires that the 
task assignment should be considered when merging files, thereby causing static 
task assignment.
* As described in this issue, we can improve fetchers to get multiple files via 
a request. This approach subsequently introduces another issue related to the 
transmission protocol. I'm also considering two approaches as follows:
** Using HTTP as in the current implementation, but improves the Fetchers and 
PullServers to handle an HTTP request for multiple files. For example, a 
Fetcher can request a virtual HTTP address that indicates multiple files. A 
PullServer who receives that request can extract real file names from the 
virtual address, and then dynamically merge those files into a single file and 
send it.
** Using an alternative transmission protocol that natively supports the 
transmission of multiple files via a request. 

I think the last one is the best approach, but I don't still have much 
background for that.
What do you think of these approaches?


was (Author: jihoonson):
Hi guys, I have two ideas for this issue.
* When writing intermediate data for shuffle, we can merge small files into 
larger ones. I think that this is not feasible because it requires that the 
task assignment should be considered when merging files, thereby causing static 
task assignment.
* As described in this issue, we can improve fetchers to get multiple files via 
a request. This approach subsequently introduces another issue related to the 
transmission protocol. I'm also considering two approaches as follows:
** Using HTTP as in the current implementation, but improves the Fetchers and 
PullServers to handle an HTTP request for multiple files. For example, a 
Fetcher can request a virtual HTTP address that indicates multiple files. A 
PullServer who receives that request can extract real file names from the 
virtual address, and then dynamically merge those files into one file and send 
it.
** Using an alternative transmission protocol that natively supports the 
transmission of multiple files via a request. 

I think the last one is the best approach, but I don't still have much 
background for that.
What do you think of these approaches?

> Improve Fetcher to get multiple shuffle outputs through a request
> -----------------------------------------------------------------
>
>                 Key: TAJO-982
>                 URL: https://issues.apache.org/jira/browse/TAJO-982
>             Project: Tajo
>          Issue Type: Improvement
>          Components: data shuffle
>            Reporter: Hyunsik Choi
>            Assignee: Jihoon Son
>             Fix For: 0.10
>
>
> Currently, Fetcher only can request at most a fetch for one shuffle output at 
> a time. The implementation can cause performance degradation even though 
> intermediate data is actually small.
> For example, If an input data set of the first stage is big and the 
> intermediate data is very small, QueryMaster will choose a few of nodes for 
> next execution block. Since each node keeps limited threads for fetch, it 
> will take a long time for the nodes for next stage to fetch all intermediate.
> If Fetcher can get multiple shuffle outputs through a request, it would solve 
> the slowness which occurs in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to