Many times a hadoop job produces a file per reducer and the job has many reducers. Or a map only job one output file per input file and you have many input files. Or you just have many small files from some external process. Hadoop has sub optimal handling of small files. There are some ways to handle this inside a map reduce program, IdentityMapper + IdentityReducer for example, or multi outputs However we wanted a tool that could be used by people using hive, or pig, or map reduce. We wanted to allow people to combine a directory with multiple files or a hierarchy of directories like the root of a hive partitioned table. We also wanted to be able to combine text or sequence files.
What we came up with is the filecrusher. Usage: /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact /user/edward/backup 50 SEQUENCE (50 is the number of mappers here) Code is Apache V2 and you can get it here: http://www.jointhegrid.com/hadoop_filecrush/index.jsp Enjoy, Edward
