[ https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651097#action_12651097 ]
Joydeep Sen Sarma commented on HADOOP-4565: ------------------------------------------- a few other comments: - do we think this patch totally supersedes multifileinputformat/multifilesplit? - if not - should CombineFileSplit extend MultiFileSplit? (the argument being that in that case CombineFileRecordReader can work for both MultiFileSplit and CombineFileSplit). In general - this is not going to be the last implementation of a multifilesplit/format - so it would be good to have the surrounding classes (recordreaders etc.) be built in a way that more implementations of a multifilesplit can be easily accomodated. - CombineFileInputFormat does not implement getRecordReader (throws an exception) - shouldn't it just be an abstract class then? - one of the bigger problems with MultiFileInputFormat was the lack of concrete implementations. I think it just makes sense to provide a full implementation of combinefileinputformat for text files (and perhaps sequencefiles) at least that can be used without writing code by lay users. - as an aside - i don't understand now why sorting racks/nodes by number of blocks matters at all. for each rack/node - one would coalesce blocks into splits. what overflows goes into micellaneous bucket. this protocol does not depend on walking through the racks/nodes in a particular order. what seems more important is that overflow blocks are first combined by rack (but i am confused about the whole rack vs. node thing) > MultiFileInputSplit can use data locality information to create splits > ---------------------------------------------------------------------- > > Key: HADOOP-4565 > URL: https://issues.apache.org/jira/browse/HADOOP-4565 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Reporter: dhruba borthakur > Assignee: dhruba borthakur > Attachments: CombineMultiFile.patch, CombineMultiFile2.patch, > CombineMultiFile3.patch > > > The MultiFileInputFormat takes a set of paths and creates splits based on > file sizes. Each splits contains a few files an each split are roughly equal > in size. It would be efficient if we can extend this InputFormat to create > splits such each all the blocks in one split and either node-local or > rack-local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.