Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
close the gap somewhat with tuning, zfs ashift/recordsize and special
allocation class vdevs. While the IOPs performance favors
nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
of ZFS and the security it provides to the most critical function in a
Lustre file system shouldn't be undervalued. From personal experience,
I'd much rather deal with zfs in the event of a seriously jackknifed
MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
to trying to unscramble a vendor blackbox hwraid volume. ;-)

When zfs directio lands and is fully integrated into Lustre the
performance differences *should* be negligible.

Just my $.02 worth

On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
<lustre-discuss@lists.lustre.org> wrote:
>
> Hi Cameron,
>
> did you run a performance comparison between ZFS and mdadm-raid on the MDTs?
> I'm currently doing some tests, and the results favor software raid, in 
> particular when it comes to IOPS.
>
> Regards
> Thomas
>
> On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
> > This doesn't answer your question about ldiskfs on zvols, but we've been 
> > running MDTs on ZFS on NVMe in production for a couple years (and on SAS 
> > SSDs for many years prior). Our current production MDTs using NVMe consist 
> > of one zpool/node made up of 3x 2-drive mirrors, but we've been 
> > experimenting lately with using raidz3 and possibly even raidz2 for MDTs 
> > since SSDs have been pretty reliable for us.
> >
> > Cameron
> >
> > On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
> > via lustre-discuss wrote:
> >> We are in the process of retiring two long standing LFS's (about 8 years 
> >> old), which we built and managed ourselves.  Both use ZFS and have the 
> >> MDT'S on ssd's in a JBOD that require the kind of software-based 
> >> management you describe, in our case ZFS pools built on multipath devices. 
> >>  The MDT in one is ZFS and the MDT in the other LFS is ldiskfs but uses 
> >> ZFS and a zvol as you describe - we build the ldiskfs MDT on top of the 
> >> zvol.  Generally, this has worked well for us, with one big caveat.  If 
> >> you look for my posts to this list and the ZFS list you'll find more 
> >> details.  The short version is that we utilize ZFS snapshots and clones to 
> >> do backups of the metadata.  We've run into situations where the backup 
> >> process stalls, leaving a clone hanging around.  We've experienced a 
> >> situation a couple of times where the clone and the primary zvol get 
> >> swapped, effectively rolling back our metadata to the point when the clone 
> >> was created.  I have tried, unsuccessfully, to recreate
> >> that in a test environment.  So if you do that kind of setup, make sure 
> >> you have good monitoring in place to detect if your backups/clones stall.  
> >> We've kept up with lustre and ZFS updates over the years and are currently 
> >> on lustre 2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and 
> >> ldiskfs performance shrink to the point where they are pretty much on par 
> >> to each now.  I think our ZFS MDT performance could be better with more 
> >> hardware and software tuning but our small team hasn't had the bandwidth 
> >> to tackle that.
> >>
> >> Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty 
> >> to talk about the proprietary way those devices are managed.  However, the 
> >> metadata performance is SO much better than our older LFS's, for a lot of 
> >> reasons, but I'd highly recommend NVMe's for your MDT's.
> >>
> >> -----Original Message-----
> >> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org 
> >> <mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Thomas Roth 
> >> via lustre-discuss <lustre-discuss@lists.lustre.org 
> >> <mailto:lustre-discuss@lists.lustre.org>>
> >> Reply-To: Thomas Roth <t.r...@gsi.de <mailto:t.r...@gsi.de>>
> >> Date: Friday, January 5, 2024 at 9:03 AM
> >> To: Lustre Diskussionsliste <lustre-discuss@lists.lustre.org 
> >> <mailto:lustre-discuss@lists.lustre.org>>
> >> Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?
> >>
> >>
> >> CAUTION: This email originated from outside of NASA. Please take care when 
> >> clicking links or opening attachments. Use the "Report Message" button to 
> >> report suspicious messages to the NASA SOC.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Dear all,
> >>
> >>
> >> considering NVME storage for the next MDS.
> >>
> >>
> >> As I understand, NVME disks are bundled in software, not by a hardware 
> >> raid controller.
> >> This would be done using Linux software raid, mdadm, correct?
> >>
> >>
> >> We have some experience with ZFS, which we use on our OSTs.
> >> But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol 
> >> on top which is then formatted with ldiskfs - to much voodoo...
> >>
> >>
> >> How is this handled elsewhere? Any experiences?
> >>
> >>
> >>
> >>
> >> The available devices are quite large. If I create a raid-10 out of 4 
> >> disks, e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB 
> >> limit.
> >> So no need for a box with lots of U.3 slots.
> >>
> >>
> >> But for MDS operations, we will still need a powerful dual-CPU system with 
> >> lots of RAM.
> >> Then the NVME devices should be distributed between the CPUs?
> >> Is there a way to pinpoint this in a call for tender?
> >>
> >>
> >>
> >>
> >> Best regards,
> >> Thomas
> >>
> >>
> >> --------------------------------------------------------------------
> >> Thomas Roth
> >>
> >>
> >> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> >> Planckstraße 1, 64291 Darmstadt, Germany, 
> >> https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
> >>   
> >> <https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
> >>  >
> >>
> >>
> >> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
> >> Managing Directors / Geschäftsführung:
> >> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
> >> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
> >> State Secretary / Staatssekretär Dr. Volkmar Dietz
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>
> >> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$
> >>   
> >> <https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$
> >>  >
> >>
> >>
> >>
> >> _______________________________________________
> >> lustre-discuss mailing list
> >> lustre-discuss@lists.lustre.org
> >> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [... Thomas Roth via lustre-discuss
    • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
      • ... Cameron Harr via lustre-discuss
        • ... Thomas Roth via lustre-discuss
          • ... Jeff Johnson
            • ... Cameron Harr via lustre-discuss
              • ... Thomas Roth via lustre-discuss
                • ... Cameron Harr via lustre-discuss
      • ... Vinícius Ferrão via lustre-discuss
        • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss

Reply via email to