Re: How to handle multiline record for inputsplit?

Harsh J Mon, 20 May 2013 23:29:22 -0700

If your record is variably multi-line, then quite logically the newline
character cannot be its "record delimiter". Use the right character or
byte(s)/info that defines the real "record delimiter" and read based on
that.


The same logic as the one described at
http://wiki.apache.org/hadoop/HadoopMapReduce for newline-delimited records
applies for your files as well.


On Tue, May 21, 2013 at 11:37 AM, Darpan R <darpa...@gmail.com> wrote:

> Hi folks,
> I have a huge text file in TBs and it has multiline records. And we are not
> given that each records takes how many lines. One records can be of size 5
> lines, other may be of 6 lines another may be 4 lines. Its not sure. Line
> size may vary for each record.
> Since we cannot use default TextInputFormat, we have written own
> inputformat and a custom record reader but the confusion is :
>
> "When splits are happening, it is not sure if each split will contain the
> full record. Some part of record can go in split 1 and another in split 2."
> But this is not what we want.
>
> So, can anyone suggest how to handle this scenario so that we can guarantee
> that one full record goes in a single InputSplit ?
> Any work around or hint will be really useful.
>
> Thanks in advance.
>  DR
>



-- 
Harsh J

Re: How to handle multiline record for inputsplit?

Reply via email to