On Wed, Jun 9, 2010 at 9:55 PM, wd <[email protected]> wrote:
> I have lots of small files in hive, the mapred is too slow .... Is there a > way to improve the speed ? > > 2010/6/10 Edward Capriolo <[email protected]> > > >> >> On Wed, Jun 9, 2010 at 3:04 AM, wd <[email protected]> wrote: >> >>> I've tried hive 0.5, the option not work too. >>> And find this page[ >>> http://markmail.org/message/k32nrcb2ncsq67ef?q=mapred.map.tasks+#query:mapred.map.tasks%20+page:1+mid:k32nrcb2ncsq67ef+state:results] >>> via google. >>> >>> 2010/6/9 wd <[email protected]> >>> >>> hi, >>>> >>>> I'm using hive svn rev946854. And try to set mapred.map.tasks=1 at hive >>>> cli, but seemes it doesn't work, total map tasks still over 300+. >>>> >>>> Is this a svn version problem? >>>> >>> >>> >> You answered your own question, look in the link >> >> "You cannot force *mapred.map.tasks* but can specify mapred.reduce.tasks. >> " >> >> Map tasks is based on the number of input files and folders. Even though >> hive uses a CombinedInput format you still can get a number of mappers. >> >> Edward >> > > With hadoop 20 and the Combine InputFormat you should get fairly decent performance even with many small files. My current employer is about to open source FileCrusher, a stand alone and map reduce application that merges Text and Sequence files into one big one. So if you hang tight for a couple days a can point you at a utility that might help.
