[
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664802#action_12664802
]
Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------
the file size comment is interesting. part of the reason why we are interested
in this patch is because of cases where there are lots of small files. So this
actually would highlight the need to preserve node locality. (also matei
reported a pretty significant difference in node/rack locality on our test
cluster - that's also troubling me - we can follow that up separately).
also - the code is already structured in a way that makes node locality very
easy i think (i would hazard that as we walk the blocks in a rack and organize
them into splits - all we need to do is order them by hostname and we would be
done). thoughts?
if we can preserve node locality - then there is never any reason to use the
base inputformats. we can always use the combineinputformat and let it organize
splits optimally. otherwise not having node locality is something we are always
going to struggle with.
nit: CombineFileRecordReader - rrConstructor.
+ static final Class [] constructorSignature = new Class []
{InputSplit.class, Configuration.class, Reporter.class, Integer.class};
can be tightened to:
+ static final Class [] constructorSignature = new Class []
{CombineFileSplit.class, Configuration.class, Reporter.class, Integer.class};
looks good otherwise.
> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
> Key: HADOOP-4565
> URL: https://issues.apache.org/jira/browse/HADOOP-4565
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
> Attachments: CombineMultiFile.patch, CombineMultiFile2.patch,
> CombineMultiFile3.patch, CombineMultiFile4.patch, CombineMultiFile5.patch,
> CombineMultiFile7.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on
> file sizes. Each splits contains a few files an each split are roughly equal
> in size. It would be efficient if we can extend this InputFormat to create
> splits such each all the blocks in one split and either node-local or
> rack-local.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.