On Sat, Dec 13, 2008 at 9:32 PM, Stuart White <[email protected]> wrote:
> (I'm quite new to hadoop and map/reduce, so some of these questions
> might not make complete sense.)
>
> I want to perform simple data transforms on large datasets, and it
> seems Hadoop is an appropriate tool.  As a simple example, let's say I
> want to read every line of a text file, uppercase it, and write it
> out.
>
> First question: would Hadoop be an appropriate tool for something like this?

Yes. Very appropriate.

>
> What is the best way to model this type of work in Hadoop?

Start with Hadoop's WordCount example in the tutorial and modify it to
your requirement.

>
> I'm thinking my mappers will accept a Long key that represents the
> byte offset into the input file, and a Text value that represents the
> line in the file.
>
> I *could* simply uppercase the text lines and write them to an output
> file directly in the mapper (and not use any reducers).  So, there's a
> question: is it considered bad practice to write output files directly
> from mappers?

Technically, you could do this by opening the file writer in
configure(), do the writes in map() and close the writer in close().
But to me this appears contorted when the Hadoop framework has
something straight-forward.

>
> Assuming it's advisable in this example to write a file directly in
> the mapper - how should the mapper create a unique output partition
> file name?
> Is there a way for a mapper to know its index in the total
> # of mappers?

use mapred.task.id to create unique name per mapper.

>
> Assuming it's inadvisable to write a file directly in the mapper - I
> can output the records to the reducers using the same key and using
> the uppercased data as the value.  Then, in my reducer, should I write
> a file?  Or should I collect() the records in the reducers and let
> hadoop write the output?
>
> If I let hadoop write the output, is there a way to prevent hadoop
> from writing the key to the output file?  I may want to perform
> several transformations, one-after-another, on a set of data, and I
> don't want to place a superfluous key at the front of every record for
> each pass of the data.
>

Just use collect() and TextOutputFormat. The key, as you correctly
noted, is the offset in the file but when you do collect(key, value)
the 'value' will be written at the appropriate offset given by the
'key'. As long as you are using TextOutputFormat there won't be any
superfluous key prefixed to your records. Another way to think of this
is when you use TextOutputFormat, the 'value's from collect() are
appended to the reduce output.

> I appreciate any feedback anyone has to offer.
>

Reply via email to