Hi.
I want to use Hadoop (Map tasks only) to process a large file. The Map
should break the input file into records and feed each record to an
external EXE program. In other words I don't want to do processing with
Map/Reduce (the external EXE will do the processing) but only to use
Hadoop to run multiple jobs in parallel over the cluster. I want to use
Python for this.
My file is a simple TXT file but unfortunatelly one record is split on
multiple rows. One record is looking like this:
> some comment bla-bla
AAGTCTGATATGCTAA
GAAGTCTTGATATGACTATA
GTTACGAAGTCTTGTTAGTTACGAAGTCTTGATA
There are multiple records one after each other, separated by nothing
else than an enter character. Rows have arbitrary lengths and there is
an arbitrary number of rows in each record.
How can I define a InputFormat for this? Which is the best solution?
(If necessary I can write a preprocessor that will merge the non-comment
rows in a single row.)
Any help that will point a beginner into the right direction will be
very appreciated.
Many thanks.
:)