The question is: Can you use LVM to do mirroring and striping at the same time, or do you need to use software raid1 below LVM for that?
When having multiple chassis, the risk is that a chassis goes down and that part of a striped LVM can affect the whole filesystem, going offline and if the drives became corrupted for some reason, it can have corrupted all of your data. If you used mirroring between two physical volumes on two enclosures and then stripe across those mirrors, you could mitigate that. Michael -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Joe Landman Sent: Thursday, May 04, 2006 11:17 AM To: Dan Stromberg Cc: [email protected]; Robert Latham Subject: Re: Large FOSS filesystems,was Re: [Beowulf] 512 nodes Myrinet cluster Challanges Dan Stromberg wrote: > On a somewhat related note, are there any FOSS filesystems that can > surpass 16 terabytes in a single filesystem - reliably? http://oss.sgi.com/projects/xfs/ Quoting: " Maximum File Size For Linux 2.4, the maximum accessible file offset is 16TB on 4K page size and 64TB on 16K page size. For Linux 2.6, when using 64 bit addressing in the block devices layer (CONFIG_LBD), file size limit increases to 9 million terabytes (or the device limits). Maximum Filesystem Size For Linux 2.4, 2 TB. For Linux 2.6 and beyond, when using 64 bit addressing in the block devices layer (CONFIG_LBD) and a 64 bit platform, filesystem size limit increases to 9 million terabytes (or the device limits). For these later kernels on 32 bit platforms, 16TB is the current limit even with 64 bit addressing enabled in the block layer. " You can put up to 48 drives in a single 5U chassis, so in theory each chassis could give you 24 TB raw. If you want hundreds of TB to PB and larger in a single file system, you are going to have to go to a cluster FS of some sort. > Even something like a 64 bit linux system aggregating gnbd exports or > similar with md or lvm and xfs or reiserfs (or whatever filesystem) > would count, if it works reliably. Testing the reliability of something like this would be hard. I would strongly suggest that you had reasonable failure modes (as compared to spectacular ones) and graceful reduction in service (rather than abrupt). This means that you would likely want to do lots of mirroring rather than RAID5/6. You would also want private and redundant storage networks behind your gndb. Not necessarily SAN level, though you could do that. lvm/md requires a local block device (last I checked) so you would need a gnbd below it if you wanted to cluster things. In this case, I might suggest building large RAID10s (pairs of mirrors), and having each unit do as much of the IO work on a dedicated and high quality card. Each RAID10 subunit would have about 4 "devices" attached as a stripe across mirrors. Without expensive hardware, the stripe/mirror would need to be done in software (lvm level). This may have serious issues unless you can make your servers redundant as well. If you use iSCSI or similar bits, or even AoE, you can solve the block problem. I can have each tray of Coraid (for example, could be done with iSCSI as well) disks appear as a single block. I can then run lvm and build a huge file system. With a little extra work, we can build a second path to another tray of disks, and set up an lvm mirror (RAID1). Thats 7.5TB mirrored. Now add in additional trays in mirrored pairs, and use LVM to stripe or concatenate across them. In the iSCSI case and in the AoE case, the issue will be having sufficient bandwidth to the trays for the large file system. You will want as many high speed connections as possible to avoid oversubscribing one. With IB interconnects (or 10GBe) it shouldn't be too hard to have multiple trays per network connection (disks will be slower than the net). With AoE, you will need a multiport GBe card or two (disks close to same speed as net). We have built large xfs and ext3 file systems on such units. I wouldn't recommend the latter (or reiserfs) for this. Jfs is reported to be quite good for larger file systems as well. Basically, with the right FS, and right set up, it is doable, though management will be a challenge. Lustre may or may not help on this. Some vendors are pushing it hard. Some are pushing GPFS hard. YMMV. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: [EMAIL PROTECTED] web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
