Praveenesh, Yes, you are absolutely right, you can indeed store >20 GB per file on such a cluster (and have it replicated properly) due to the the HDFS' chunking writes into smaller sized blocks.
On Thu, Jun 14, 2012 at 7:23 PM, praveenesh kumar <praveen...@gmail.com> wrote: > @Harsh --- > > I was wondering...although it doesn't make much/any sense --- if a person > wants to store the files only on HDFS (something like a backup) consider > the above hardware scenario --- no MR processing, In that case, it should > be possible to have a file with a size more than 20 GB to be stored on > nodes with each having 20 GB hard disk, as replicas will be evenly > distributed across the cluster, right ? > > Regards, > Praveenesh > > On Thu, Jun 14, 2012 at 7:08 PM, Harsh J <ha...@cloudera.com> wrote: > >> Ondřej, >> >> If by processing you mean trying to write out (map outputs) > 20 GB of >> data per map task, that may not be possible, as the outputs need to be >> materialized and the disk space is the constraint there. >> >> Or did I not understand you correctly (in thinking you are asking >> about MapReduce)? Cause you otherwise have ~50 GB space available for >> HDFS consumption (assuming replication = 3 for proper reliability). >> >> On Thu, Jun 14, 2012 at 1:25 PM, Ondřej Klimpera <klimp...@fit.cvut.cz> >> wrote: >> > Hello, >> > >> > we're testing application on 8 nodes, where each node has 20GB of local >> > storage available. What we are trying to achieve is to get more than >> 20GB to >> > be processed on this cluster. >> > >> > Is there a way how to distribute the data on the cluster? >> > >> > There is also one shared NFS storage disk with 1TB of available space, >> which >> > is now unused. >> > >> > Thanks for your reply. >> > >> > Ondrej Klimpera >> >> >> >> -- >> Harsh J >> -- Harsh J