It will only take up ~1KB of local datanode disk space (+ metadata space such as the CRC32 of every 512 bytes, along with replication @ 1KB per replicated block, in this case 2KB) but the real cost is a block entry in the Namenode --- all block data at the namenode lives in memory, which is a much more scare resource for the cluster in a relative sense.
On Fri, Jun 10, 2011 at 11:47 AM, Pedro Costa <psdc1...@gmail.com> wrote: > So, I'm not getting how a 1KB file can cost a block of 64MB. Can > anyone explain me? > > On Fri, Jun 10, 2011 at 5:13 PM, Philip Zeyliger <phi...@cloudera.com> wrote: >> On Fri, Jun 10, 2011 at 9:08 AM, Pedro Costa <psdc1...@gmail.com> wrote: >>> This means that, when HDFS reads 1KB file from the disk, he will put >>> the data in blocks of 64MB? >> >> No. >> >>> >>> On Fri, Jun 10, 2011 at 5:00 PM, Philip Zeyliger <phi...@cloudera.com> >>> wrote: >>>> On Fri, Jun 10, 2011 at 8:42 AM, Pedro Costa <psdc1...@gmail.com> wrote: >>>>> But, how can I say that a 1KB file will only use 1KB of disc space, if >>>>> a block is configured has 64MB? In my view, if a 1KB use a block of >>>>> 64MB, the file will occupy 64MB in the disc. >>>> >>>> A block of HDFS is the unit of distribution and replication, not the >>>> unit of storage. HDFS uses the underlying file systems for physical >>>> storage. >>>> >>>> -- Philip >>>> >>>>> >>>>> How can you disassociate a 64MB data block from HDFS of a disk block? >>>>> >>>>> On Fri, Jun 10, 2011 at 5:01 PM, Marcos Ortiz <mlor...@uci.cu> wrote: >>>>>> On 06/10/2011 10:35 AM, Pedro Costa wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> If I define HDFS to use blocks of 64 MB, and I store in HDFS a 1KB >>>>>> file, this file will ocupy 64MB in the HDFS? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> HDFS is not very efficient storing small files, because each file is >>>>>> stored >>>>>> in a block (of 64 MB in your case), and the block metadata >>>>>> is held in memory by the NN. But you should know that this 1KB file only >>>>>> will use 1KB of disc space. >>>>>> >>>>>> For small files, you can use Hadoop archives. >>>>>> Regards >>>>>> >>>>>> -- >>>>>> Marcos Luís Ortíz Valmaseda >>>>>> Software Engineer (UCI) >>>>>> http://marcosluis2186.posterous.com >>>>>> http://twitter.com/marcosluis2186 >>>>>> >>>>>> >>>>> >>>> >>> > -- Twitter: @jpatanooga Solution Architect @ Cloudera hadoop: http://www.cloudera.com blog: http://jpatterson.floe.tv