Re: My input is "plain text" but each record is split on multiple lines - I need help defining the InputFormat

Chuck Lam Thu, 30 Jul 2009 15:52:14 -0700

What you need is to subclass the TextInputFormat and override the
getRecordReader() to return your own RecordReader. Your RecordReader
will be similar to LineRecordReader, so you can look at that class'
source code to get inspiration. Main difference is that you're looking
for record boundaries on consecutive line breaks rather than single
line breaks.

However, I think you should also consider Hadoop Streaming as an
alternative. You can write your mapper in Python. Under Streaming,
your mapper will already be reading the input line by line, so you can
just keep a stack of lines since the last empty line and send it to
your exe program whenever you hit an empty line again.

The main drawback is that you'll have records that cross block
boundaries. Streaming can't read across block boundaries so you'll
have to throw out the last/first record of each block. Depending on
your application that's either no big deal or a deal breaker.

some doc:
http://hadoop.apache.org/common/docs/r0.20.0/streaming.html

Out of curiosity, it's pretty clear that you're processing DNA data.
Mind to share some background on your application? ;) I've been pretty
curious what ppl are using Hadoop for in the biology space.

On Tue, Jul 28, 2009 at 2:25 PM, CubicDesign<[email protected]> wrote:
> Hi.
>
> I want to use Hadoop (Map tasks only) to process a large file. The Map
> should break the input file into records and feed each record to an external
> EXE program. In other words I don't want to do processing with Map/Reduce
> (the external EXE will do the processing) but only to use Hadoop to run
> multiple jobs in parallel over the cluster. I want to use Python for this.
>
>
> My file is a simple TXT file but unfortunatelly one record is split on
> multiple rows. One record is looking like this:
>
>> some comment bla-bla
> AAGTCTGATATGCTAA
> GAAGTCTTGATATGACTATA
> GTTACGAAGTCTTGTTAGTTACGAAGTCTTGATA
> There are multiple records one after each other, separated by nothing else
> than an enter character. Rows have arbitrary lengths and there is an
> arbitrary number of rows in each record.
> How can I define a InputFormat for this? Which is the best solution?
> (If necessary I can write a preprocessor that will merge the non-comment
> rows in a single row.)
>
>
> Any help that will point a beginner into the right direction will be very
> appreciated.
> Many thanks.
> :)
>
>
>

Re: My input is "plain text" but each record is split on multiple lines - I need help defining the InputFormat

Reply via email to