If your record is variably multi-line, then quite logically the newline character cannot be its "record delimiter". Use the right character or byte(s)/info that defines the real "record delimiter" and read based on that.
The same logic as the one described at http://wiki.apache.org/hadoop/HadoopMapReduce for newline-delimited records applies for your files as well. On Tue, May 21, 2013 at 11:37 AM, Darpan R <darpa...@gmail.com> wrote: > Hi folks, > I have a huge text file in TBs and it has multiline records. And we are not > given that each records takes how many lines. One records can be of size 5 > lines, other may be of 6 lines another may be 4 lines. Its not sure. Line > size may vary for each record. > Since we cannot use default TextInputFormat, we have written own > inputformat and a custom record reader but the confusion is : > > "When splits are happening, it is not sure if each split will contain the > full record. Some part of record can go in split 1 and another in split 2." > But this is not what we want. > > So, can anyone suggest how to handle this scenario so that we can guarantee > that one full record goes in a single InputSplit ? > Any work around or hint will be really useful. > > Thanks in advance. > DR > -- Harsh J