I'm not sure I understand why you distinguish small HFiles and a single 
behemoth HFile?  Are you trying to understand more about disk space or I/O 
patterns?

It looks like your understanding is correct.  At the worst point, a given 
Region will use twice it's disk space during a major compaction.  Once the 
compaction is complete, the original files are deleted from HDFS.  So it is not 
the case that your entire dataset will require double the space for compactions 
as they are not all run concurrently.

JG

> -----Original Message-----
> From: Vidhyashankar Venkataraman [mailto:vidhy...@yahoo-inc.com]
> Sent: Monday, May 17, 2010 11:56 AM
> To: hbase-user@hadoop.apache.org
> Cc: Joel Koshy
> Subject: Additional disk space required for Hbase compactions..
> 
> Hi guys,
>   I am quite new to Hbase.. I am trying to figure out the max
> additional disk space required for compactions..
> 
>   If the set of small Hfiles amount to a size of U in total, before a
> major compaction happens and the 'behemoth' HFile has size M, assuming
> the resultant size of the Hfile after compaction is U+M (worst case has
> only insertions) and a replication factor of r, then disk space taken
> by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
> based on my understanding that compactions happen on HDFS and not on
> the local file system: am I correct?)...
> 
> Thank you
> Vidhya
> 

Reply via email to