My input is "plain text" but each record is split on multiple lines - I need help defining the InputFormat

CubicDesign Tue, 28 Jul 2009 14:26:35 -0700

Hi.

I want to use Hadoop (Map tasks only) to process a large file. The Mapshould break the input file into records and feed each record to anexternal EXE program. In other words I don't want to do processing withMap/Reduce (the external EXE will do the processing) but only to useHadoop to run multiple jobs in parallel over the cluster. I want to usePython for this.

My file is a simple TXT file but unfortunatelly one record is split onmultiple rows. One record is looking like this:


> some comment bla-bla
AAGTCTGATATGCTAA
GAAGTCTTGATATGACTATA

GTTACGAAGTCTTGTTAGTTACGAAGTCTTGATAThere are multiple records one after each other, separated by nothingelse than an enter character. Rows have arbitrary lengths and there isan arbitrary number of rows in each record.

How can I define a InputFormat for this? Which is the best solution?

(If necessary I can write a preprocessor that will merge the non-commentrows in a single row.)

Any help that will point a beginner into the right direction will bevery appreciated.

Many thanks.
:)

My input is "plain text" but each record is split on multiple lines - I need help defining the InputFormat

Reply via email to