Edward: Thanks for the tool. I think the last parameter can be omitted if you follow what hadoop fs -text does. It looks at a file's magic number so that it can attempt to *detect* the type of the file.
Cheers On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo <[email protected]>wrote: > Many times a hadoop job produces a file per reducer and the job has > many reducers. Or a map only job one output file per input file and > you have many input files. Or you just have many small files from some > external process. Hadoop has sub optimal handling of small files. > There are some ways to handle this inside a map reduce program, > IdentityMapper + IdentityReducer for example, or multi outputs However > we wanted a tool that could be used by people using hive, or pig, or > map reduce. We wanted to allow people to combine a directory with > multiple files or a hierarchy of directories like the root of a hive > partitioned table. We also wanted to be able to combine text or > sequence files. > > What we came up with is the filecrusher. > > Usage: > /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact > /user/edward/backup 50 SEQUENCE > (50 is the number of mappers here) > > Code is Apache V2 and you can get it here: > http://www.jointhegrid.com/hadoop_filecrush/index.jsp > > Enjoy, > Edward >
