I think I see now. Just to recap... you are right that TextOutputFormat outputs Key\tValue\n, which in your case gives: File_position\tText_block\n.
But as your Text_block contains '\n' your output actually comes out as: Key Value ------- ------------- file_position first_line_in_text_block second_line_in_text_block NOVALUE third_line_in_text_block NOVALUE ... As I mentioned in my other reply, I think you need to write your own OutputFormat to get the output file exactly how you want (perhaps something like LineRecordWriter which doesn't write the key out and outputs a separator of your choosing between each record). -----Original Message----- From: Francesco Tamberi [mailto:[EMAIL PROTECTED] Sent: 10 July 2008 17:15 To: [email protected] Subject: Re: Custom InputFormat/OutputFormat Ok, I would not like to annoy you but I think I'm missing something.. I have to: - extract relevant text blocks from really big document (<doc id= .....> TEXTBLOCK </doc>) - apply some python/c/c++ functions as mappers to text blocks (called via shell script) - output processed text back to text file In order to do that I: - wrote a CustomInputFormat that creates [File_position / Text_block] tuples as key/values and - invoked hadoop without reduce phase (-jobconf mapred.reduce.tasks=0) 'cause I don't want my output to be sorted/grouped. As far as I can see the write method of LineRecordWriter class in TextOutputFormat just writes (if not nulls) Key\tValue so I thought that, using "cat" as mapper for testing the CustomInputFormat, the result should be: File_position\tText_block\n Instead, as you already know, I got a tuple for evey line, like that: file_position / first_line_in_text_block second_line_in_text_block / NOVALUE third_line_in_text_block / NOVALUE ... What am I missing? Thank you for your patience.. Francesco Jingkei Ly ha scritto: > I think I need to understand what you are trying to achieve better, so > apologies if these two options don't answer your question fully! > > 1) If you want to operate on the text in the reducer, then you won't > need to make any changes as the data between mapper and reducer is > stored as SequenceFiles so won't suffer from records being delimited > by newline characters. So the input to the reducer will see records in > the > form: > > Key: file_pos > Value: all your text with newlines preserved > > 2) If, however, you are more interested in outputting human-readable > plain-text files with the specifications you want at the end of your > MapReduce program you will probably need to implement your own > OutputFormat which does not output the key, and does not use newline > characters to separate records. I would suggest looking at > TextOutputFormat to start. > > HTH, > Jingkei > > -----Original Message----- > From: Francesco Tamberi [mailto:[EMAIL PROTECTED] > Sent: 10 July 2008 14:17 > To: [email protected] > Subject: Re: Custom InputFormat/OutputFormat > > Thank you so much. > The problem is that I need to operate on text as is, without > modification, and I don't want the filepos to be outputted. > There's no way in hadoop to map and to output a block of text > containing newline characters? > Thank you again, > Francesco > > Jingkei Ly ha scritto: > >> I think you need to strip out the newline characters in the value you >> return, as the TextOutputFormat will treat each newline character as >> the start of a new record. >> >> -----Original Message----- >> From: Francesco Tamberi [mailto:[EMAIL PROTECTED] >> Sent: 09 July 2008 11:27 >> To: [email protected] >> Subject: Custom InputFormat/OutputFormat >> >> Hi all, >> I want to use hadoop for some streaming text processing on text >> documents like: >> >> <doc id=... ... ... > >> text text >> text >> ... >> </doc> >> >> >> Just xml-like notation but not real xml files. >> >> I have to work on text included between <doc> tags, so I implemented >> an InputFormat (extending FileInputFormat) with a RecordReader that >> return file position as Key and needed text as Value. >> This is next method and I'm pretty sure that it works as expected.. >> >> /** Read a text block. */ >> public synchronized boolean next(LongWritable key, Text >> value) >> > > >> throws IOException >> { >> if (pos >= end) >> return false; >> >> key.set(pos); // key is position >> buffer.reset(); >> long bytesRead = readBlock(startTag, endTag); // put >> needed text in buffer >> if (bytesRead == 0) >> return false; >> >> pos += bytesRead; >> value.set(buffer.getData(), 0, buffer.getLength()); >> return true; >> } >> >> But when I test it, using "cat" as mapper function and >> TextOutputFormat as OutputFormat, I have one key/value per line: >> For every text block, the first tuple has fileposition as key and >> text >> > > >> as value, remaining have text as key and no value... ie: >> >> file_pos / first_line >> second_line / >> third_line / >> ... >> >> Where am I wrong? >> >> Thank you in advance, >> Francesco >> >> >> >> This message should be regarded as confidential. If you have received >> > this email in error please notify the sender and destroy it immediately. > >> Statements of intent shall only become binding when confirmed in hard >> > copy by an authorised signatory. The contents of this email may > relate to dealings with other companies within the Detica Group plc > group of companies. > >> Detica Limited is registered in England under No: 1337451. >> >> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, >> > England. > >> >> > >
