On 14/10/16 14:38, Dilger, Andreas wrote:


John, with newer Lustre clients it is possible for multiple threads to submit non-overlapping writes concurrently (also not conflicting within a single page), see LU-1669 for details.

Even so, O_DIRECT writes need to be synchronous to disk on the OSS, as Patrick reports, because if the OSS fails before the write is on disk there is no cached copy of the data on the client that can be used to resend the RPC.

The problem is that the ZFS OSD has very long transaction commit times for synchronous writes because it does not yet have support for the ZIL. Using buffered writes, or having very large O_DIRECT writes (e.g. 40MB or larger) and large RPCs (4MB, or up to 16MB in 2.9.0) to amortize the sync overhead may be beneficial if you really want to use O_DIRECT.

Riccardo,

The other potential issue is that you have 20 OSTs on a single OSS, which isn't going to have very good performance. Spreading the OSTs across multiple OSS nodes is going to improve your performance significantly when there are multiple clients writing, as there will be N times the OSS network bandwidth, N times the CPU, N times the RAM. It only makes sense to have 20 OSTs/OSS if your workload is only a single client and you want the maximum possible capacity for a given cost.


Hello Andreas,
each OST has a separate VDEV and separate zpool.
thank you

Is each OST a separate VDEV and separate zpool, or are they a single zpool? Separate zpools have less overhead for maximum performance, but only one VDEV per zpool means that metadata ditto blocks are written twice per RAID-Z2 VDEV, which isn't very efficient. Having at least 3 VDEVs per zpool is better in this regard.

Cheers, Andreas

--

Andreas Dilger

Lustre Principal Architect

Intel High Performance Data Division

On 2016/10/14, 15:22, "John Bauer" <bau...@iodoctors.com <mailto:bau...@iodoctors.com>> wrote:

Patrick

I thought at one time there was an inode lock held for the duration of the direct I/O read or write. So that even if one had multiple application threads writing direct, only one was "in flight" at a time. Has that changed?

John

Sent from my iPhone


On Oct 14, 2016, at 3:16 PM, Patrick Farrell <p...@cray.com <mailto:p...@cray.com>> wrote:

    Sorry, I phrased one thing wrong:
    I said "transferring to the network", but it's actually until it's
    received confirmation the data has been received successfully, I
    believe.

    In any case, only one I/O (per thread) can be outstanding at a
    time with direct I/O.

    ------------------------------------------------------------------------

    *From:*lustre-discuss <lustre-discuss-boun...@lists.lustre.org
    <mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of
    Patrick Farrell <p...@cray.com <mailto:p...@cray.com>>
    *Sent:* Friday, October 14, 2016 3:12:22 PM
    *To:* Riccardo Veraldi; lustre-discuss@lists.lustre.org
    <mailto:lustre-discuss@lists.lustre.org>
    *Subject:* Re: [lustre-discuss] Lustre on ZFS pooer direct I/O
    performance

    Riccardo,

    While the difference is extreme, direct I/O write performance will
    always be poor.  Direct I/O writes cannot be asynchronous, since
    they don't use the page cache.  This means Lustre cannot return
    from one write (and start the next) until it has finished
    transferring the data to the network.

    This means you can only have one I/O in flight at a time. Good
    write performance from Lustre (or any network filesystem) depends
    on keeping a lot of data in flight at once.

    What sort of direct write performance were you hoping for? It will
    never match that 800 MB/s from one thread you see with buffered I/O.

    - Patrick

    ------------------------------------------------------------------------

    *From:*lustre-discuss <lustre-discuss-boun...@lists.lustre.org
    <mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of
    Riccardo Veraldi <riccardo.vera...@cnaf.infn.it
    <mailto:riccardo.vera...@cnaf.infn.it>>
    *Sent:* Friday, October 14, 2016 2:22:32 PM
    *To:* lustre-discuss@lists.lustre.org
    <mailto:lustre-discuss@lists.lustre.org>
    *Subject:* [lustre-discuss] Lustre on ZFS pooer direct I/O
    performance

    Hello,

    I would like how may I improve the situation of my lustre cluster.

    I have 1 MDS and 1 OSS with 20 OST defined.

    Each OST is a 8x Disks RAIDZ2.

    A single process write performance is around 800MB/sec

    anyway if I force direct I/O, for example using oflag=direct in
    dd, the
    write performance drop as low as 8MB/sec

    with 1MB block size. And each write it's about 120ms latency.

    I used these ZFS settings

    options zfs zfs_prefetch_disable=1
    options zfs zfs_txg_history=120
    options zfs metaslab_debug_unload=1

    i am quite worried for the low performance.

    Any hints or suggestions that may help me to improve the situation ?


    thank you


    Rick


    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss@lists.lustre.org
    <mailto:lustre-discuss@lists.lustre.org>
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss@lists.lustre.org
    <mailto:lustre-discuss@lists.lustre.org>
    http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to