Really thanks,
but I still cannot understand why lines after the first one become a
key.. why it happens? Shouldn't they be still Value's part??
I implemented a CustomOutputFormat that writes Values only out and I got:
first_line_in_text_block
EOF
I tried outputting Key only and I got:
second_line_in_text_block
third_line_in_text_block
...
last_line_in_text_block
EOF
So it seems there's no way to go on... and it's seems impossible to me..
Any hints?
Thank you again,
Francesco
Jingkei Ly ha scritto:
I think I see now. Just to recap... you are right that TextOutputFormat
outputs Key\tValue\n, which in your case gives:
File_position\tText_block\n.
But as your Text_block contains '\n' your output actually comes out as:
Key Value
------- -------------
file_position first_line_in_text_block
second_line_in_text_block NOVALUE
third_line_in_text_block NOVALUE ...
As I mentioned in my other reply, I think you need to write your own
OutputFormat to get the output file exactly how you want (perhaps
something like LineRecordWriter which doesn't write the key out and
outputs a separator of your choosing between each record).
-----Original Message-----
From: Francesco Tamberi [mailto:[EMAIL PROTECTED]
Sent: 10 July 2008 17:15
To: [email protected]
Subject: Re: Custom InputFormat/OutputFormat
Ok, I would not like to annoy you but I think I'm missing something..
I have to:
- extract relevant text blocks from really big document (<doc id= .....>
TEXTBLOCK </doc>)
- apply some python/c/c++ functions as mappers to text blocks (called
via shell script)
- output processed text back to text file
In order to do that I:
- wrote a CustomInputFormat that creates [File_position / Text_block]
tuples as key/values and
- invoked hadoop without reduce phase (-jobconf mapred.reduce.tasks=0)
'cause I don't want my output to be sorted/grouped.
As far as I can see the write method of LineRecordWriter class in
TextOutputFormat just writes (if not nulls) Key\tValue so I thought
that, using "cat" as mapper for testing the CustomInputFormat, the
result should be:
File_position\tText_block\n
Instead, as you already know, I got a tuple for evey line, like that:
file_position / first_line_in_text_block second_line_in_text_block /
NOVALUE third_line_in_text_block / NOVALUE ...
What am I missing?
Thank you for your patience..
Francesco
Jingkei Ly ha scritto:
I think I need to understand what you are trying to achieve better, so
apologies if these two options don't answer your question fully!
1) If you want to operate on the text in the reducer, then you won't
need to make any changes as the data between mapper and reducer is
stored as SequenceFiles so won't suffer from records being delimited
by newline characters. So the input to the reducer will see records in
the
form:
Key: file_pos
Value: all your text with newlines preserved
2) If, however, you are more interested in outputting human-readable
plain-text files with the specifications you want at the end of your
MapReduce program you will probably need to implement your own
OutputFormat which does not output the key, and does not use newline
characters to separate records. I would suggest looking at
TextOutputFormat to start.
HTH,
Jingkei
-----Original Message-----
From: Francesco Tamberi [mailto:[EMAIL PROTECTED]
Sent: 10 July 2008 14:17
To: [email protected]
Subject: Re: Custom InputFormat/OutputFormat
Thank you so much.
The problem is that I need to operate on text as is, without
modification, and I don't want the filepos to be outputted.
There's no way in hadoop to map and to output a block of text
containing newline characters?
Thank you again,
Francesco
Jingkei Ly ha scritto:
I think you need to strip out the newline characters in the value you
return, as the TextOutputFormat will treat each newline character as
the start of a new record.
-----Original Message-----
From: Francesco Tamberi [mailto:[EMAIL PROTECTED]
Sent: 09 July 2008 11:27
To: [email protected]
Subject: Custom InputFormat/OutputFormat
Hi all,
I want to use hadoop for some streaming text processing on text
documents like:
<doc id=... ... ... >
text text
text
...
</doc>
Just xml-like notation but not real xml files.
I have to work on text included between <doc> tags, so I implemented
an InputFormat (extending FileInputFormat) with a RecordReader that
return file position as Key and needed text as Value.
This is next method and I'm pretty sure that it works as expected..
/** Read a text block. */
public synchronized boolean next(LongWritable key, Text
value)
throws IOException
{
if (pos >= end)
return false;
key.set(pos); // key is position
buffer.reset();
long bytesRead = readBlock(startTag, endTag); // put
needed text in buffer
if (bytesRead == 0)
return false;
pos += bytesRead;
value.set(buffer.getData(), 0, buffer.getLength());
return true;
}
But when I test it, using "cat" as mapper function and
TextOutputFormat as OutputFormat, I have one key/value per line:
For every text block, the first tuple has fileposition as key and
text
as value, remaining have text as key and no value... ie:
file_pos / first_line
second_line /
third_line /
...
Where am I wrong?
Thank you in advance,
Francesco
This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it
immediately.
Statements of intent shall only become binding when confirmed in hard
copy by an authorised signatory. The contents of this email may
relate to dealings with other companies within the Detica Group plc
group of companies.
Detica Limited is registered in England under No: 1337451.
Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
England.