Hi.

I want to use Hadoop (Map tasks only) to process a large file. The Map should break the input file into records and feed each record to an external EXE program. In other words I don't want to do processing with Map/Reduce (the external EXE will do the processing) but only to use Hadoop to run multiple jobs in parallel over the cluster. I want to use Python for this.


My file is a simple TXT file but unfortunatelly one record is split on multiple rows. One record is looking like this:

> some comment bla-bla
AAGTCTGATATGCTAA
GAAGTCTTGATATGACTATA
GTTACGAAGTCTTGTTAGTTACGAAGTCTTGATA There are multiple records one after each other, separated by nothing else than an enter character. Rows have arbitrary lengths and there is an arbitrary number of rows in each record.
How can I define a InputFormat for this? Which is the best solution?
(If necessary I can write a preprocessor that will merge the non-comment rows in a single row.)


Any help that will point a beginner into the right direction will be very appreciated.
Many thanks.
:)


Reply via email to