What you need is to subclass the TextInputFormat and override the getRecordReader() to return your own RecordReader. Your RecordReader will be similar to LineRecordReader, so you can look at that class' source code to get inspiration. Main difference is that you're looking for record boundaries on consecutive line breaks rather than single line breaks.
However, I think you should also consider Hadoop Streaming as an alternative. You can write your mapper in Python. Under Streaming, your mapper will already be reading the input line by line, so you can just keep a stack of lines since the last empty line and send it to your exe program whenever you hit an empty line again. The main drawback is that you'll have records that cross block boundaries. Streaming can't read across block boundaries so you'll have to throw out the last/first record of each block. Depending on your application that's either no big deal or a deal breaker. some doc: http://hadoop.apache.org/common/docs/r0.20.0/streaming.html Out of curiosity, it's pretty clear that you're processing DNA data. Mind to share some background on your application? ;) I've been pretty curious what ppl are using Hadoop for in the biology space. On Tue, Jul 28, 2009 at 2:25 PM, CubicDesign<[email protected]> wrote: > Hi. > > I want to use Hadoop (Map tasks only) to process a large file. The Map > should break the input file into records and feed each record to an external > EXE program. In other words I don't want to do processing with Map/Reduce > (the external EXE will do the processing) but only to use Hadoop to run > multiple jobs in parallel over the cluster. I want to use Python for this. > > > My file is a simple TXT file but unfortunatelly one record is split on > multiple rows. One record is looking like this: > >> some comment bla-bla > AAGTCTGATATGCTAA > GAAGTCTTGATATGACTATA > GTTACGAAGTCTTGTTAGTTACGAAGTCTTGATA > There are multiple records one after each other, separated by nothing else > than an enter character. Rows have arbitrary lengths and there is an > arbitrary number of rows in each record. > How can I define a InputFormat for this? Which is the best solution? > (If necessary I can write a preprocessor that will merge the non-comment > rows in a single row.) > > > Any help that will point a beginner into the right direction will be very > appreciated. > Many thanks. > :) > > >
