[ 
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651097#action_12651097
 ] 

Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------

a few other comments:

- do we think this patch totally supersedes 
multifileinputformat/multifilesplit? 
- if not - should CombineFileSplit extend MultiFileSplit? (the argument being 
that in that case CombineFileRecordReader can work for both MultiFileSplit and 
CombineFileSplit).

  In general - this is not going to be the last implementation of a 
multifilesplit/format - so it would be good to have the surrounding classes 
(recordreaders etc.) be built in a way that more implementations of a 
multifilesplit can be easily accomodated.

- CombineFileInputFormat does not implement getRecordReader (throws an 
exception) - shouldn't it just be an abstract class then?

- one of the bigger problems with MultiFileInputFormat was the lack of concrete 
implementations. I think it just makes sense to provide a full implementation 
of combinefileinputformat for text files (and perhaps sequencefiles) at least 
that can be used without writing code by lay users.

- as an aside - i don't understand now why sorting racks/nodes by number of 
blocks matters at all. for each rack/node - one would coalesce blocks into 
splits. what overflows goes into micellaneous bucket. this protocol does not 
depend on walking through the racks/nodes in a particular order. what seems 
more important is that overflow blocks are first combined by rack (but i am 
confused about the whole rack vs. node thing)

> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-4565
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4565
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: CombineMultiFile.patch, CombineMultiFile2.patch, 
> CombineMultiFile3.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on 
> file sizes. Each splits contains a few files an each split are roughly equal 
> in size. It would be efficient if we can extend this InputFormat to create 
> splits such each all the blocks in one split and either node-local or 
> rack-local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to