Many times a hadoop job produces a file per reducer and the job has
many reducers. Or a map only job one output file per input file and
you have many input files. Or you just have many small files from some
external process. Hadoop has sub optimal handling of small files.
There are some ways to handle this inside a map reduce program,
IdentityMapper + IdentityReducer for example, or multi outputs However
we wanted a tool that could be used by people using hive, or pig, or
map reduce. We wanted to allow people to combine a directory with
multiple files or a hierarchy of directories like the root of a hive
partitioned table. We also wanted to be able to combine text or
sequence files.

What we came up with is the filecrusher.

Usage:
/usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact
/user/edward/backup 50 SEQUENCE
(50 is the number of mappers here)

Code is Apache V2 and you can get it here:
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

Enjoy,
Edward

Reply via email to