Thank you Allen. So, is it fair to assume that if I have smaller block size (64 MB), then my blocks are distributed across more datanodes and because my blocks are around more datanodes, then my map jobs should also run on different datanodes and becuase the maps size will be smaller, it should execute faster using less resources. Should this work this way ? or is there any algorithm on how the blocks should be distributed across the datanodes and where should the replication copies should go ?
Lets say, I have a file of 640 MB and a cluster with 5 datanodes and configured the block size to be 64 MB. How will this be distributed ? Regards Syed Wasti > From: [email protected] > To: [email protected] > Subject: Re: Data Block Size ? > Date: Thu, 15 Jul 2010 18:49:04 +0000 > > > On Jul 15, 2010, at 11:40 AM, Syed Wasti wrote: > > > Will it matter what the data block size is ? > > Yes. > > > It is recommended to have a block size of 64 MB, but if we want to have the > > data block size to 128 MB, should this effect the performance ? > > Yes. > > FWIW, we run with 128MB. > > > Does the size of the map jobs created on each datanodes in anyway depend > > the block size ? > > Yes. > > Unless told otherwise, Hadoop will generally use the # of maps == # of > blocks. So if you have fewer blocks to process, you'll have fewer maps to do > more work. This is not necessarily a bad thing; it all depends upon your > workload, size of grid, etc. >
