Hi all,
I want to use hadoop for some streaming text processing on text documents like:

<doc id=... ... ... >
text text
text
...
</doc>


Just xml-like notation but not real xml files.

I have to work on text included between <doc> tags, so I implemented an InputFormat (extending FileInputFormat) with a RecordReader that return file position as Key and needed text as Value.
This is next method and I'm pretty sure that it works as expected..

/** Read a text block. */
public synchronized boolean next(LongWritable key, Text value) throws IOException
       {
           if (pos >= end)
               return false;

           key.set(pos); // key is position
           buffer.reset();
long bytesRead = readBlock(startTag, endTag); // put needed text in buffer
           if (bytesRead == 0)
               return false;
pos += bytesRead;
           value.set(buffer.getData(), 0, buffer.getLength());
           return true;
       }

But when I test it, using "cat" as mapper function and TextOutputFormat as OutputFormat, I have one key/value per line: For every text block, the first tuple has fileposition as key and text as value, remaining have text as key and no value... ie:

file_pos / first_line
second_line /
third_line /
...

Where am I wrong?

Thank you in advance,
Francesco

Reply via email to