Cool, thx Hyunsik On Wednesday, February 19, 2014, Hyunsik Choi <[email protected]> wrote:
> Henry, > > Thank you for your comment :) > > In my opinion, this explanation is too specific, and the comments on the > source code is more accessible for this kind of explanation So, now, I'll > add this explanation on the source code as comments. Later, I'll try to > create some design documentations. > > Thanks, > Hyunsik Choi > > > On Wed, Feb 19, 2014 at 10:44 AM, Henry Saputra > <[email protected]<javascript:;> > >wrote: > > > Hyunsik, > > > > I am +1 this is worthy of a wiki page or separate page to explain the > > technical flow. Tracing the flow in the code is confusing. > > > > Maybe similar to the doc like in YARN [1] page (the more details the > > better ^_^) > > > > > > - Henry > > > > [1] > > > http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html > > > > On Thu, Feb 13, 2014 at 2:18 AM, Hyunsik Choi <[email protected]> > wrote: > > > Hi Min, > > > > > > Above all, I'm very sorry for lack of documentations on the source > code. > > So > > > far, we have developed Tajo with insufficient documentations by only > > > pursuiting a quick and dirty manner. We should fill more documentations > > on > > > the source code. > > > > > > I'm going to explain how Tajo uses disk volume locality. Before this > > > explanation, I would like to explain the node locality that you may > > already > > > know. Similar to MR, Tajo also uses three level locality for each task. > > For > > > each task, the scheduler finds local node, closest rack, and random > node > > > sequentially. In Tajo, the scheduler additionally finds the local > volume > > > prior to finding the local node. > > > > > > The important thing is that we don't need to aware of actual disk > volume > > > IDs in each local node, and we just assigne disk volumes to TaskRunners > > in > > > a node in a round robin manner. It would be sufficient to improve the > > load > > > balancing by considering disk volume. > > > > > > Initially, TaskRunners are not mapped to disk volumes in each worker. > The > > > mapping occurs dynamically in the scheduler. For example, there are 6 > > local > > > tasks for some node N1 which has 3 disk volumes (v1, v2, and v3). Also, > > > three TaskRunners (T1, T2, and T3) will be running on the node N1. > > > > > > When tasks are added to the scheduler, the scheduler gets the disk > volume > > > id from each task. As you know, each volume is just an integer which is > > > just a logical identifier for just distinguishing different disk > volumes. > > > Then, the scheduler builds a map between disk volume ids (obtained from > > > BlockStorageLocation) in each node and a list of tasks > > > (DefaultTaskScheduler::addQueryUnitAttemptId). In other words, each > entry > > > in the map consists of one disk volume id and a list of tasks > > corresponding > > > to the disk volume. > > > > > > When the first task is requested from a TaskRunner T1 in node N1, the > > > scheduler just assignes the first disk volume v1 to T1, and then it > > > schedules one task which belongs to the disk volume v1. Later, a task > is > > > requested from a different TaskRunner T2 from node N1, the schedules > > > assignes the second disk volume v2 to T2, and then it schedules a task > > > which belongs to the disk volume v2. Also, a task request is given > from > > T1 > > > again, the scheduler schedules one task in the disk volume v1 to T1 > > because > > > T1 is already mapped to v1. > > > > > > Like MR, Tajo uses a dynamic scheduling, and it works very well in the > > > environments where each node has different performance disks. If you > have > > > additional question, please feel free to ask. > > > > > > Also, I'll create a Jira issue to add this explain to > > DefaultTaskScheduler. > > > > > > - hyunsik > > > > > > > > > > > > On Thu, Feb 13, 2014 at 5:29 PM, Min Zhou <[email protected]> wrote: > > > > > >> Hi Jihoon, > > >> > > >> Thank you for you answer. However, seem you didn't answer that how > tajo > > use > > >> disk information to balance the io overhead. > > >> > > >> And still can't understand the details, quite complex to me, > especially > > >> the class TaskBlockLocation > > >> > > >> > > >> public static class TaskBlockLocation { > > >> // This is a mapping from diskId to a list of pending task, right? > > >> private HashMap<Integer, LinkedList<QueryUnitAttemptId>
