The way an input reader works with a file split, is that the input reader is responsible for finding the first start of a record boundary in the input split, and to stop at the first record end boundary at or after the end of the input split.
In your case if your image data is structured in the file in a manner that the segment start/ends can be found in the input data stream it becomes as simple as that detection to provide a reader that can handle splits. On Tue, Dec 22, 2009 at 8:27 AM, Cao Kang <[email protected]> wrote: > Hi Amareshwari, > Thanks for your replies. They are really good suggestions. > But I probably have one more question remain. About HDFS, it splits the > input file into 64M blocks in a sequential way by input file bytes, right? > But it is against the idea to split the image into sub images by using its > four corners. Is there a way to configure HDFS to make it compatible with > the image split? > Many thanks! > > > Cao > > On Mon, Dec 21, 2009 at 11:43 PM, Amareshwari Sri Ramadasu < > [email protected]> wrote: > > > Hi Cao, > > > > My answers are inline. > > > > On 12/21/09 8:42 PM, "Cao Kang" <[email protected]> wrote: > > > > Hi Amareshwari, > > Thanks for your reply. > > But another question is, where and how should I define the split > > boundaries? > > Should I define it in FileSplit constructor? > > > > I don't think you can extend FileSplit directly. I think you should write > > your own split say ImageSplit, in which you can represent your image > fully. > > For example, FileSplit represents the split using offset and length. You > > may need all four co-ordinates of your image. > > > > Furthermore, as far as I have seen, all examples there use longwritable > to > > represent the offset of that split in the input file. What is the split > is > > not sequential? > > > > Yes. FileSplit is used for representing text data. > > For example, in the image split, the sub images bytes array > > are not sequential from the input image. The bytes split look like this: > > > > |---------------|---------------| > > | | | > > | 1 | 2 | > > | | | > > |---------------|---------------| > > | | | > > | 3 | 4 | > > | | | > > |---------------|---------------| > > > > Each sub image split will be consisted by an array. Where and how this > > should be defined in InputFormat? Many thanks. > > > > In your InputFormat, you should define getSplits() method which returns > > your ImageSplits. > > > > Thanks > > Amareshwari > > > > > > On Mon, Dec 21, 2009 at 6:37 AM, Amareshwari Sri Ramadasu < > > [email protected]> wrote: > > > > > You should implement your split to represent the split information. > Then > > > you should implement getSplits in InputFormat to get the splits from > your > > > input, which divides the whole input into chunks. Here, each split will > > be > > > given to a map task. > > > You should also define RecordReader which reads records from the split. > > Map > > > task processes one record at a time. > > > > > > See > > > > > > http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Job+Input > > > > > > Thanks > > > Amareshwari > > > > > > On 12/21/09 2:22 AM, "Cao Kang" <[email protected]> wrote: > > > > > > Hi, > > > I have spent several days on the customized file input format in > hadoop. > > > Basically, we need split one giant square shaped image (.tif) into four > > > square shaped smaller images. Where does the really split happen? > Should > > I > > > implement it in "getSplits" function or in the "next" function? It's > > quite > > > confusing. > > > Does anyone know or can anyone provide some examples of it? Thanks. > > > > > > > > > > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
