(I'm quite new to hadoop and map/reduce, so some of these questions might not make complete sense.)
I want to perform simple data transforms on large datasets, and it seems Hadoop is an appropriate tool. As a simple example, let's say I want to read every line of a text file, uppercase it, and write it out. First question: would Hadoop be an appropriate tool for something like this? What is the best way to model this type of work in Hadoop? I'm thinking my mappers will accept a Long key that represents the byte offset into the input file, and a Text value that represents the line in the file. I *could* simply uppercase the text lines and write them to an output file directly in the mapper (and not use any reducers). So, there's a question: is it considered bad practice to write output files directly from mappers? Assuming it's advisable in this example to write a file directly in the mapper - how should the mapper create a unique output partition file name? Is there a way for a mapper to know its index in the total # of mappers? Assuming it's inadvisable to write a file directly in the mapper - I can output the records to the reducers using the same key and using the uppercased data as the value. Then, in my reducer, should I write a file? Or should I collect() the records in the reducers and let hadoop write the output? If I let hadoop write the output, is there a way to prevent hadoop from writing the key to the output file? I may want to perform several transformations, one-after-another, on a set of data, and I don't want to place a superfluous key at the front of every record for each pass of the data. I appreciate any feedback anyone has to offer.
