Merge sorting reduce output files

Niels Basjes Tue, 28 Feb 2012 12:11:23 -0800

Hi,

We have a job that outputs a set of files that are several hundred MB of
text each.
Using the comparators and such we can produce output files that are each
sorted by themselves.


What we want is to have one giant outputfile (outside of the cluster) that
is sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because
that would put a significant part of the processing time through 1 cpu
(this would take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool "sort
-m"
4) Install HDFS fuse and run the standard commandline tool "sort -m"
5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort
-m" in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Merge sorting reduce output files

Reply via email to