Re: A new way to merge up those small files!

Ted Yu Sat, 25 Sep 2010 09:06:08 -0700

Edward:
Thanks for the tool.

I think the last parameter can be omitted if you follow what hadoop fs -text
does.
It looks at a file's magic number so that it can attempt to *detect* the
type of the file.


Cheers

On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo <[email protected]>wrote:

> Many times a hadoop job produces a file per reducer and the job has
> many reducers. Or a map only job one output file per input file and
> you have many input files. Or you just have many small files from some
> external process. Hadoop has sub optimal handling of small files.
> There are some ways to handle this inside a map reduce program,
> IdentityMapper + IdentityReducer for example, or multi outputs However
> we wanted a tool that could be used by people using hive, or pig, or
> map reduce. We wanted to allow people to combine a directory with
> multiple files or a hierarchy of directories like the root of a hive
> partitioned table. We also wanted to be able to combine text or
> sequence files.
>
> What we came up with is the filecrusher.
>
> Usage:
> /usr/bin/hadoop jar filecrush.jar crush.Crush /directory/to/compact
> /user/edward/backup 50 SEQUENCE
> (50 is the number of mappers here)
>
> Code is Apache V2 and you can get it here:
> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>
> Enjoy,
> Edward
>

Re: A new way to merge up those small files!

Reply via email to