RE: Custom InputFormat/OutputFormat

Jingkei Ly Thu, 10 Jul 2008 09:59:21 -0700

I think I see now. Just to recap... you are right that TextOutputFormat
outputs Key\tValue\n, which in your case gives:
File_position\tText_block\n.


But as your Text_block contains '\n' your output actually comes out as:

Key                                     Value
-------                         -------------
file_position                   first_line_in_text_block
second_line_in_text_block       NOVALUE
third_line_in_text_block        NOVALUE ...

As I mentioned in my other reply, I think you need to write your own
OutputFormat to get the output file exactly how you want (perhaps
something like LineRecordWriter which doesn't write the key out and
outputs a separator of your choosing between each record).


-----Original Message-----
From: Francesco Tamberi [mailto:[EMAIL PROTECTED] 
Sent: 10 July 2008 17:15
To: [email protected]
Subject: Re: Custom InputFormat/OutputFormat

Ok, I would not like to annoy you but I think I'm missing something..
I have to:
- extract relevant text blocks from really big document (<doc id= .....>
TEXTBLOCK </doc>)
- apply some python/c/c++ functions as mappers to text blocks (called
via shell script)
- output processed text back to text file

In order to do that I:
- wrote a CustomInputFormat that creates [File_position / Text_block]
tuples as key/values and
- invoked hadoop without reduce phase (-jobconf mapred.reduce.tasks=0)
'cause I don't want my output to be sorted/grouped.

As far as I can see the write method of LineRecordWriter class in
TextOutputFormat just writes (if not nulls) Key\tValue so I thought
that, using "cat" as mapper for testing the CustomInputFormat, the
result should be:
File_position\tText_block\n

Instead, as you already know,  I got a tuple for evey line, like that:

file_position / first_line_in_text_block second_line_in_text_block /
NOVALUE third_line_in_text_block / NOVALUE ...

What am I missing?
Thank you for your patience..
Francesco

Jingkei Ly ha scritto:
> I think I need to understand what you are trying to achieve better, so

> apologies if these two options don't answer your question fully!
>
> 1) If you want to operate on the text in the reducer, then you won't 
> need to make any changes as the data between mapper and reducer is 
> stored as SequenceFiles so won't suffer from records being delimited 
> by newline characters. So the input to the reducer will see records in

> the
> form:
>  
> Key: file_pos
> Value: all your text with newlines preserved
>
> 2) If, however, you are more interested in outputting human-readable 
> plain-text files with the specifications you want at the end of your 
> MapReduce program you will probably need to implement your own 
> OutputFormat which does not output the key, and does not use newline 
> characters to separate records. I would suggest looking at 
> TextOutputFormat to start.
>
> HTH,
> Jingkei
>
> -----Original Message-----
> From: Francesco Tamberi [mailto:[EMAIL PROTECTED]
> Sent: 10 July 2008 14:17
> To: [email protected]
> Subject: Re: Custom InputFormat/OutputFormat
>
> Thank you so much.
> The problem is that I need to operate on text as is, without 
> modification, and I don't want the filepos to be outputted.
> There's no way in hadoop to map and to output a block of text 
> containing newline characters?
> Thank you again,
> Francesco
>
> Jingkei Ly ha scritto:
>   
>> I think you need to strip out the newline characters in the value you

>> return, as the TextOutputFormat will treat each newline character as 
>> the start of a new record.
>>
>> -----Original Message-----
>> From: Francesco Tamberi [mailto:[EMAIL PROTECTED]
>> Sent: 09 July 2008 11:27
>> To: [email protected]
>> Subject: Custom InputFormat/OutputFormat
>>
>> Hi all,
>> I want to use hadoop for some streaming text processing on text 
>> documents like:
>>
>> <doc id=... ... ... >
>> text text
>> text
>> ...
>> </doc>
>>
>>
>> Just xml-like notation but not real xml files.
>>
>> I have to work on text included between <doc> tags, so I implemented 
>> an InputFormat (extending FileInputFormat) with a RecordReader that 
>> return file position as Key and needed text as Value.
>> This is next method and I'm pretty sure that it works as expected..
>>
>> /** Read a text block. */
>>         public synchronized boolean next(LongWritable key, Text 
>> value)
>>     
>
>   
>> throws IOException
>>         {
>>             if (pos >= end)
>>                 return false;
>>
>>             key.set(pos); // key is position
>>             buffer.reset();
>>             long bytesRead = readBlock(startTag, endTag); // put 
>> needed text in buffer
>>             if (bytesRead == 0)
>>                 return false;
>>            
>>             pos += bytesRead;
>>             value.set(buffer.getData(), 0, buffer.getLength());
>>             return true;
>>         }
>>
>> But when I test it, using "cat" as mapper function and 
>> TextOutputFormat as OutputFormat, I have one key/value per line:
>> For every text block, the first tuple has fileposition as key and 
>> text
>>     
>
>   
>> as value, remaining have text as key and no value... ie:
>>
>> file_pos / first_line
>> second_line /
>> third_line /
>> ...
>>
>> Where am I wrong?
>>
>> Thank you in advance,
>> Francesco
>>
>>
>>
>> This message should be regarded as confidential. If you have received
>>     
> this email in error please notify the sender and destroy it
immediately.
>   
>> Statements of intent shall only become binding when confirmed in hard
>>     
> copy by an authorised signatory.  The contents of this email may 
> relate to dealings with other companies within the Detica Group plc 
> group of companies.
>   
>> Detica Limited is registered in England under No: 1337451.
>>
>> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
>>     
> England.
>   
>>   
>>     
>
>

RE: Custom InputFormat/OutputFormat

Reply via email to