Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can close the gap somewhat with tuning, zfs ashift/recordsize and special allocation class vdevs. While the IOPs performance favors nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities of ZFS and the security it provides to the most critical function in a Lustre file system shouldn't be undervalued. From personal experience, I'd much rather deal with zfs in the event of a seriously jackknifed MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable to trying to unscramble a vendor blackbox hwraid volume. ;-)
When zfs directio lands and is fully integrated into Lustre the performance differences *should* be negligible. Just my $.02 worth On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss <lustre-discuss@lists.lustre.org> wrote: > > Hi Cameron, > > did you run a performance comparison between ZFS and mdadm-raid on the MDTs? > I'm currently doing some tests, and the results favor software raid, in > particular when it comes to IOPS. > > Regards > Thomas > > On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote: > > This doesn't answer your question about ldiskfs on zvols, but we've been > > running MDTs on ZFS on NVMe in production for a couple years (and on SAS > > SSDs for many years prior). Our current production MDTs using NVMe consist > > of one zpool/node made up of 3x 2-drive mirrors, but we've been > > experimenting lately with using raidz3 and possibly even raidz2 for MDTs > > since SSDs have been pretty reliable for us. > > > > Cameron > > > > On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] > > via lustre-discuss wrote: > >> We are in the process of retiring two long standing LFS's (about 8 years > >> old), which we built and managed ourselves. Both use ZFS and have the > >> MDT'S on ssd's in a JBOD that require the kind of software-based > >> management you describe, in our case ZFS pools built on multipath devices. > >> The MDT in one is ZFS and the MDT in the other LFS is ldiskfs but uses > >> ZFS and a zvol as you describe - we build the ldiskfs MDT on top of the > >> zvol. Generally, this has worked well for us, with one big caveat. If > >> you look for my posts to this list and the ZFS list you'll find more > >> details. The short version is that we utilize ZFS snapshots and clones to > >> do backups of the metadata. We've run into situations where the backup > >> process stalls, leaving a clone hanging around. We've experienced a > >> situation a couple of times where the clone and the primary zvol get > >> swapped, effectively rolling back our metadata to the point when the clone > >> was created. I have tried, unsuccessfully, to recreate > >> that in a test environment. So if you do that kind of setup, make sure > >> you have good monitoring in place to detect if your backups/clones stall. > >> We've kept up with lustre and ZFS updates over the years and are currently > >> on lustre 2.14 and ZFS 2.1. We've seen the gap between our ZFS MDT and > >> ldiskfs performance shrink to the point where they are pretty much on par > >> to each now. I think our ZFS MDT performance could be better with more > >> hardware and software tuning but our small team hasn't had the bandwidth > >> to tackle that. > >> > >> Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at liberty > >> to talk about the proprietary way those devices are managed. However, the > >> metadata performance is SO much better than our older LFS's, for a lot of > >> reasons, but I'd highly recommend NVMe's for your MDT's. > >> > >> -----Original Message----- > >> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org > >> <mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth > >> via lustre-discuss <lustre-discuss@lists.lustre.org > >> <mailto:lustre-discuss@lists.lustre.org>> > >> Reply-To: Thomas Roth <t.r...@gsi.de <mailto:t.r...@gsi.de>> > >> Date: Friday, January 5, 2024 at 9:03 AM > >> To: Lustre Diskussionsliste <lustre-discuss@lists.lustre.org > >> <mailto:lustre-discuss@lists.lustre.org>> > >> Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME? > >> > >> > >> CAUTION: This email originated from outside of NASA. Please take care when > >> clicking links or opening attachments. Use the "Report Message" button to > >> report suspicious messages to the NASA SOC. > >> > >> > >> > >> > >> > >> > >> > >> > >> Dear all, > >> > >> > >> considering NVME storage for the next MDS. > >> > >> > >> As I understand, NVME disks are bundled in software, not by a hardware > >> raid controller. > >> This would be done using Linux software raid, mdadm, correct? > >> > >> > >> We have some experience with ZFS, which we use on our OSTs. > >> But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol > >> on top which is then formatted with ldiskfs - to much voodoo... > >> > >> > >> How is this handled elsewhere? Any experiences? > >> > >> > >> > >> > >> The available devices are quite large. If I create a raid-10 out of 4 > >> disks, e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB > >> limit. > >> So no need for a box with lots of U.3 slots. > >> > >> > >> But for MDS operations, we will still need a powerful dual-CPU system with > >> lots of RAM. > >> Then the NVME devices should be distributed between the CPUs? > >> Is there a way to pinpoint this in a call for tender? > >> > >> > >> > >> > >> Best regards, > >> Thomas > >> > >> > >> -------------------------------------------------------------------- > >> Thomas Roth > >> > >> > >> GSI Helmholtzzentrum für Schwerionenforschung GmbH > >> Planckstraße 1, 64291 Darmstadt, Germany, > >> https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ > >> > >> <https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ > >> > > >> > >> > >> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 > >> Managing Directors / Geschäftsführung: > >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock > >> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: > >> State Secretary / Staatssekretär Dr. Volkmar Dietz > >> > >> > >> > >> > >> _______________________________________________ > >> lustre-discuss mailing list > >> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org> > >> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ > >> > >> <https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ > >> > > >> > >> > >> > >> _______________________________________________ > >> lustre-discuss mailing list > >> lustre-discuss@lists.lustre.org > >> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ > > _______________________________________________ > > lustre-discuss mailing list > > lustre-discuss@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- ------------------------------ Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite C - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org