On 12/06/12 18:56, Ellis H. Wilson III wrote: > On 06/08/12 20:06, Bill Broadley wrote: >> A new user on one of my GigE clusters submits batches of 500 jobs that >> need to randomly read a 30-60GB dataset. They aren't the only user of >> said cluster so each job will be waiting in the queue with a mix of others. > With a 160TB cluster and only a 30-60GB dataset, is there any reason why > the user isn't simply storing their dataset in HDFS? Does the data > change frequently via a non-MapReduce framework such that it needs to be > pulled from NFS before every job? If the dataset is in a few dozen > files and in HDFS in the cluster, there is no reason why MapReduce > shouldn't spawn it's tasks directly "on" the data, without need (most of > the time) for moving all of the data to every node as you mention.
From experience this can have varied results and still requires careful management/thought. With HDFS if the replicate number is 3 (often the default case) and the 30 node cluster has 500 jobs then either an initial step is required to replicate the data to all other cluster nodes and then perform the analysis (this imposes the expected network / disk IO impact and job start up latency already in place). Alternatively keep the replication at 3 (or a.n.other defined number) and limit the number of jobs to the available resources where the data replicates pre-exist. The challenge is finding the sweet spot for the work in progress and as always nothing is ever free. So HDFS does not remove the replication process although it helps to hide the processes involved. The other joy encountered with HDFS is that we found it can be less than stable in a multi user environment, this has been confirmed by various others so as always care is required during testing. There are alternatives to HDFS which can be used in conjunction with Hadoop but I'm afraid I'm not able to recommend any in particular as it's been a while since I last kicked the tyres. Is this something that others have more recent experience with and can recommend an alternative ? Pete -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf