Re: File Split

Cao Kang Mon, 21 Dec 2009 07:13:23 -0800

Hi Amareshwari,
Thanks for your reply.
But another question is, where and how should I define the split boundaries?
Should I define it in FileSplit constructor?
Furthermore, as far as I have seen, all examples there use longwritable to
represent the offset of that split in the input file. What is the split is
not sequential? For example, in the image split, the sub images bytes array
are not sequential from the input image. The bytes split look like this:


|---------------|---------------|
|               |               |
|         1    |       2      |
|               |               |
|---------------|---------------|
|               |               |
|       3      |       4      |
|               |               |
|---------------|---------------|

Each sub image split will be consisted by an array. Where and how this
should be defined in InputFormat? Many thanks.


Cao

On Mon, Dec 21, 2009 at 6:37 AM, Amareshwari Sri Ramadasu <
[email protected]> wrote:

> You should implement your split to represent the split information. Then
> you should implement getSplits in InputFormat to get the splits from your
> input, which divides the whole input into chunks. Here, each split will be
> given to a map task.
> You should also define RecordReader which reads records from the split. Map
> task processes one record at a time.
>
> See
> http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Job+Input
>
> Thanks
> Amareshwari
>
> On 12/21/09 2:22 AM, "Cao Kang" <[email protected]> wrote:
>
> Hi,
> I have spent several days on the customized file input format in hadoop.
> Basically, we need split one giant square shaped image (.tif) into four
> square shaped smaller images. Where does the really split happen?  Should I
> implement it in "getSplits" function or in the "next" function? It's quite
> confusing.
> Does anyone know or can anyone provide some examples of it? Thanks.
>
>

Re: File Split

Reply via email to