On Mon, May 17, 2010 at 3:03 PM, Jonathan Gray <jg...@facebook.com> wrote:
> I'm not sure I understand why you distinguish small HFiles and a single > behemoth HFile? Are you trying to understand more about disk space or I/O > patterns? > > It looks like your understanding is correct. At the worst point, a given > Region will use twice it's disk space during a major compaction. Once the > compaction is complete, the original files are deleted from HDFS. So it is > not the case that your entire dataset will require double the space for > compactions as they are not all run concurrently. > > JG > > > -----Original Message----- > > From: Vidhyashankar Venkataraman [mailto:vidhy...@yahoo-inc.com] > > Sent: Monday, May 17, 2010 11:56 AM > > To: hbase-user@hadoop.apache.org > > Cc: Joel Koshy > > Subject: Additional disk space required for Hbase compactions.. > > > > Hi guys, > > I am quite new to Hbase.. I am trying to figure out the max > > additional disk space required for compactions.. > > > > If the set of small Hfiles amount to a size of U in total, before a > > major compaction happens and the 'behemoth' HFile has size M, assuming > > the resultant size of the Hfile after compaction is U+M (worst case has > > only insertions) and a replication factor of r, then disk space taken > > by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is > > based on my understanding that compactions happen on HDFS and not on > > the local file system: am I correct?)... > > > > Thank you > > Vidhya > > > > I do not have the answer, but let me warn the major compaction is VERY VERY IO intensive. In our case, it put a system that would typically respond to get request in 1-2ms up to 30 or 2000ms(at times). That was not acceptable for us. I think in most normal cases it is not good to force the issue with the major compaction.