[
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666834#action_12666834
]
Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------
looks pretty good to me.
One question about the use of min and max. The logic for splitting at rack
locality level makes sense - if rack Total < max but > min - then there's a
split.
but the logic for splitting by node doesn't have this logic. it always looks
for > max.
I think what might make sense is to have three thresholds:
- minNodeSplit
- minRackSplit
- maxSplit
meaning that anything more than minNodeSplit causes split at node level (before
combining with next node), anything more than minRackSplit causes split at rack
level (before combining across racks) and we never go beyond maxSplit ever.
These three are likely to be strictly ordered (a smaller split is bad - but is
worth it to maximize locality and node locality is worth more than rack
locality).
i don't know how we would come up with these numbers though!
One small simplification - i think maxSize = 0 is essentially being treated as
infinity. Setting maxSize to MAXINT if it is equal to 0 will simplify the logic
in some of the iterators.
> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
> Key: HADOOP-4565
> URL: https://issues.apache.org/jira/browse/HADOOP-4565
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
> Attachments: CombineMultiFile.patch, CombineMultiFile2.patch,
> CombineMultiFile3.patch, CombineMultiFile4.patch, CombineMultiFile5.patch,
> CombineMultiFile7.patch, CombineMultiFile8.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on
> file sizes. Each splits contains a few files an each split are roughly equal
> in size. It would be efficient if we can extend this InputFormat to create
> splits such each all the blocks in one split and either node-local or
> rack-local.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.