Custom InputFormat/OutputFormat

Francesco Tamberi Wed, 09 Jul 2008 03:27:49 -0700

Hi all,

I want to use hadoop for some streaming text processing on textdocuments like:


<doc id=... ... ... >
text text
text
...
</doc>


Just xml-like notation but not real xml files.

I have to work on text included between <doc> tags, so I implemented anInputFormat (extending FileInputFormat) with a RecordReader that returnfile position as Key and needed text as Value.

This is next method and I'm pretty sure that it works as expected..

/** Read a text block. */

public synchronized boolean next(LongWritable key, Text value)throws IOException

       {
           if (pos >= end)
               return false;

           key.set(pos); // key is position
           buffer.reset();

long bytesRead = readBlock(startTag, endTag); // put neededtext in buffer

           if (bytesRead == 0)
               return false;

pos += bytesRead;

           value.set(buffer.getData(), 0, buffer.getLength());
           return true;
       }

But when I test it, using "cat" as mapper function and TextOutputFormatas OutputFormat, I have one key/value per line:For every text block, the first tuple has fileposition as key and textas value, remaining have text as key and no value... ie:


file_pos / first_line
second_line /
third_line /
...

Where am I wrong?

Thank you in advance,
Francesco

Custom InputFormat/OutputFormat

Reply via email to