I'm not sure I understand why you distinguish small HFiles and a single behemoth HFile? Are you trying to understand more about disk space or I/O patterns?
It looks like your understanding is correct. At the worst point, a given Region will use twice it's disk space during a major compaction. Once the compaction is complete, the original files are deleted from HDFS. So it is not the case that your entire dataset will require double the space for compactions as they are not all run concurrently. JG > -----Original Message----- > From: Vidhyashankar Venkataraman [mailto:vidhy...@yahoo-inc.com] > Sent: Monday, May 17, 2010 11:56 AM > To: hbase-user@hadoop.apache.org > Cc: Joel Koshy > Subject: Additional disk space required for Hbase compactions.. > > Hi guys, > I am quite new to Hbase.. I am trying to figure out the max > additional disk space required for compactions.. > > If the set of small Hfiles amount to a size of U in total, before a > major compaction happens and the 'behemoth' HFile has size M, assuming > the resultant size of the Hfile after compaction is U+M (worst case has > only insertions) and a replication factor of r, then disk space taken > by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is > based on my understanding that compactions happen on HDFS and not on > the local file system: am I correct?)... > > Thank you > Vidhya >