Hi Will- Le Tuesday 11 October 2011, Will Maier a écrit : > As I described earlier, our ~1PB cluster is very similar: some large > servers, some small ones. We store data recorded at and simulated for the > CMS detector at the LHC, as well as the products of our users' physics > analysis. Like you, much of our data can be redownloaded to our Tier2 from > the nearest Tier1 (Fermilab, USA).
Indeed, it's a similar setup, at a smaller scale: Tier2 for CMS at UCLouvain, Belgium, with data&sim from the whole collaboration; stageout files from the Grid for our local users ... > And still, we've gone with a fairly standard configuration, with 2x > replication across the board. I strongly suggest trying to design your > cluster to exploit the good parts of HDFS instead of making it look like > whatever your previous system was (ours was dCache, a distributed disk > cache developed in and for the high energy physics community). HDFS doesn't > map replicas to servers and works best with a replication factor >=2. These > were new constraints for us when we migrated our storage cluster, but the > benefits of storing data on commodity servers (our worker nodes) with > inbuilt replication/resiliency and great throughput more than compensated. We were working with a kind of homemade FUSE solution built on top of NFS (with a Metadata server hosting symbolic links to the different mass storage servers). The data was sprinkled over the mass storage servers at the file level, not block level. It achieved great performance and worked very well as far as everything was ... working well. Eg: when a NFS server fails, the clients have a strange behaviour (whole nfs hanging). No clean admin tools. For us, it is difficult to set a 2x replication while maintaining the current amount of data we serve to the Collaboration. An alternative would be to switch the mass storage servers with currently one large ZFS partition each to servers with as many independent partitions as the # of drives they host (typical # of drives per server: 70); and set at the Hadoop level a 2x replication. The volume freed by killing the RAID is not enough to compensate the replication but that would be a first step. Vincent