Hi all,
I want to use hadoop for some streaming text processing on text
documents like:
<doc id=... ... ... >
text text
text
...
</doc>
Just xml-like notation but not real xml files.
I have to work on text included between <doc> tags, so I implemented an
InputFormat (extending FileInputFormat) with a RecordReader that return
file position as Key and needed text as Value.
This is next method and I'm pretty sure that it works as expected..
/** Read a text block. */
public synchronized boolean next(LongWritable key, Text value)
throws IOException
{
if (pos >= end)
return false;
key.set(pos); // key is position
buffer.reset();
long bytesRead = readBlock(startTag, endTag); // put needed
text in buffer
if (bytesRead == 0)
return false;
pos += bytesRead;
value.set(buffer.getData(), 0, buffer.getLength());
return true;
}
But when I test it, using "cat" as mapper function and TextOutputFormat
as OutputFormat, I have one key/value per line:
For every text block, the first tuple has fileposition as key and text
as value, remaining have text as key and no value... ie:
file_pos / first_line
second_line /
third_line /
...
Where am I wrong?
Thank you in advance,
Francesco