[jira] Commented: (HADOOP-4565) MultiFileInputSplit can use data locality information to create splits

Joydeep Sen Sarma (JIRA) Wed, 26 Nov 2008 01:23:38 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650904#action_12650904
 ]


Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------

a few comments:

- can u explain whether the blocks are split by racks or nodes? (the data 
structure says nodeToBlock etc. - but all the comments refer to 'racks')

if we are combining splits by nodes - then wouldn't it make sense to also sort 
the nodes by racks first (and perhaps only then by number of blocks)? (so that 
we can combine blocks that cannot be combined within a given node with other 
blocks in the same rack?)

- in getMoreSplits() - i didn't understand:

   +      if (minSize == 0 || curSplitSize > minSize || !iter.hasNext()) {

  it seems iter.hasNext() is being used to try to detect the end of the loop - 
but iter,hasNext() can be false even in the middle of the loop - right? (the 
way i am reading it - iter is being used to 'seek' to the current rack(/node) 
in the nodeToBlocks array based on sorting by number of blocks in rack)

- somewhat confused by how overflow blocks (curSplitSize < minSize) are being 
handled. looks like with current scheme - if there is even one overflow block 
from  current rack - then it will be combined with blocks available from the 
next rack. this seems to have some issues:
  * the racks list is going to have both the racks - but the blocks are 
probably overwhelmingly dominated by the next rack
  * the racks list is not cleared after the overflow blocks are dealt with in 
the first split created on the next rack. so the next of splits will all have 
the previous rack in the racks list unnecessarily (I presume this will lead to 
incorrect inference about the locality of splits)

instead - we could have collected all the overflow blocks from each rack and 
just combined them separately into splits at the end. would be simpler to 
understand/code and not change the overall amount of locality much i imagine


> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-4565
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4565
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: CombineMultiFile.patch, CombineMultiFile2.patch, 
> CombineMultiFile3.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on 
> file sizes. Each splits contains a few files an each split are roughly equal 
> in size. It would be efficient if we can extend this InputFormat to create 
> splits such each all the blocks in one split and either node-local or 
> rack-local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4565) MultiFileInputSplit can use data locality information to create splits

Reply via email to