RE: Custom InputFormat/OutputFormat

Jingkei Ly Thu, 10 Jul 2008 07:15:33 -0700

I think I need to understand what you are trying to achieve better, so
apologies if these two options don't answer your question fully!


1) If you want to operate on the text in the reducer, then you won't
need to make any changes as the data between mapper and reducer is
stored as SequenceFiles so won't suffer from records being delimited by
newline characters. So the input to the reducer will see records in the
form:
 
Key: file_pos
Value: all your text with newlines preserved

2) If, however, you are more interested in outputting human-readable
plain-text files with the specifications you want at the end of your
MapReduce program you will probably need to implement your own
OutputFormat which does not output the key, and does not use newline
characters to separate records. I would suggest looking at
TextOutputFormat to start.

HTH,
Jingkei

-----Original Message-----
From: Francesco Tamberi [mailto:[EMAIL PROTECTED] 
Sent: 10 July 2008 14:17
To: [email protected]
Subject: Re: Custom InputFormat/OutputFormat

Thank you so much.
The problem is that I need to operate on text as is, without
modification, and I don't want the filepos to be outputted.
There's no way in hadoop to map and to output a block of text containing
newline characters?
Thank you again,
Francesco

Jingkei Ly ha scritto:
> I think you need to strip out the newline characters in the value you 
> return, as the TextOutputFormat will treat each newline character as 
> the start of a new record.
>
> -----Original Message-----
> From: Francesco Tamberi [mailto:[EMAIL PROTECTED]
> Sent: 09 July 2008 11:27
> To: [email protected]
> Subject: Custom InputFormat/OutputFormat
>
> Hi all,
> I want to use hadoop for some streaming text processing on text 
> documents like:
>
> <doc id=... ... ... >
> text text
> text
> ...
> </doc>
>
>
> Just xml-like notation but not real xml files.
>
> I have to work on text included between <doc> tags, so I implemented 
> an InputFormat (extending FileInputFormat) with a RecordReader that 
> return file position as Key and needed text as Value.
> This is next method and I'm pretty sure that it works as expected..
>
> /** Read a text block. */
>         public synchronized boolean next(LongWritable key, Text value)

> throws IOException
>         {
>             if (pos >= end)
>                 return false;
>
>             key.set(pos); // key is position
>             buffer.reset();
>             long bytesRead = readBlock(startTag, endTag); // put 
> needed text in buffer
>             if (bytesRead == 0)
>                 return false;
>            
>             pos += bytesRead;
>             value.set(buffer.getData(), 0, buffer.getLength());
>             return true;
>         }
>
> But when I test it, using "cat" as mapper function and 
> TextOutputFormat as OutputFormat, I have one key/value per line:
> For every text block, the first tuple has fileposition as key and text

> as value, remaining have text as key and no value... ie:
>
> file_pos / first_line
> second_line /
> third_line /
> ...
>
> Where am I wrong?
>
> Thank you in advance,
> Francesco
>
>
>
> This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
copy by an authorised signatory.  The contents of this email may relate
to dealings with other companies within the Detica Group plc group of
companies.
>
> Detica Limited is registered in England under No: 1337451.
>
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
England.
>
>
>

RE: Custom InputFormat/OutputFormat

Reply via email to