[jira] Commented: (HADOOP-4565) MultiFileInputSplit can use data locality information to create splits

Joydeep Sen Sarma (JIRA) Fri, 16 Jan 2009 21:00:26 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664802#action_12664802
 ]


Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------

the file size comment is interesting. part of the reason why we are interested 
in this patch is because of cases where there are lots of small files. So this 
actually would highlight the need to preserve node locality. (also matei 
reported a pretty significant difference in node/rack locality on our test 
cluster - that's also troubling me - we can follow that up separately).

also - the code is already structured in a way that makes node locality very 
easy i think (i would hazard that as we walk the blocks in a rack and organize 
them into splits - all we need to do is order them by hostname and we would be 
done). thoughts?

if we can preserve node locality - then there is never any reason to use the 
base inputformats. we can always use the combineinputformat and let it organize 
splits optimally. otherwise not having node locality is something we are always 
going to struggle with. 


nit: CombineFileRecordReader - rrConstructor. 

+  static final Class [] constructorSignature = new Class []  
{InputSplit.class,  Configuration.class, Reporter.class, Integer.class};

can be tightened to:

+  static final Class [] constructorSignature = new Class []  
{CombineFileSplit.class,  Configuration.class, Reporter.class, Integer.class};

looks good otherwise.



> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-4565
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4565
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: CombineMultiFile.patch, CombineMultiFile2.patch, 
> CombineMultiFile3.patch, CombineMultiFile4.patch, CombineMultiFile5.patch, 
> CombineMultiFile7.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on 
> file sizes. Each splits contains a few files an each split are roughly equal 
> in size. It would be efficient if we can extend this InputFormat to create 
> splits such each all the blocks in one split and either node-local or 
> rack-local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4565) MultiFileInputSplit can use data locality information to create splits

Reply via email to