Re: [ceph-users] ceph + vmware

Frédéric Nass Fri, 22 Jul 2016 07:18:37 -0700


Le 22/07/2016 14:10, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *OnBehalf Of *Frédéric Nass

*Sent:* 22 July 2016 11:19

*To:* n...@fisk.me.uk; 'Jake Young' <jak3...@gmail.com>; 'JanSchermer' <j...@schermer.cz>

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 11:48, Nick Fisk a écrit :

    *From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
    *Sent:* 22 July 2016 10:40
    *To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
    <jak3...@gmail.com> <mailto:jak3...@gmail.com>; 'Jan Schermer'
    <j...@schermer.cz> <mailto:j...@schermer.cz>
    *Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    *Subject:* Re: [ceph-users] ceph + vmware

    Le 22/07/2016 10:23, Nick Fisk a écrit :

        *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
        *On Behalf Of *Frédéric Nass
        *Sent:* 22 July 2016 09:10
        *To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
        <jak3...@gmail.com> <mailto:jak3...@gmail.com>; 'Jan Schermer'
        <j...@schermer.cz> <mailto:j...@schermer.cz>
        *Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        *Subject:* Re: [ceph-users] ceph + vmware

        Le 22/07/2016 09:47, Nick Fisk a écrit :

            *From:*ceph-users
            [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of
            *Frédéric Nass
            *Sent:* 22 July 2016 08:11
            *To:* Jake Young <jak3...@gmail.com>
            <mailto:jak3...@gmail.com>; Jan Schermer <j...@schermer.cz>
            <mailto:j...@schermer.cz>
            *Cc:* ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            *Subject:* Re: [ceph-users] ceph + vmware

            Le 20/07/2016 21:20, Jake Young a écrit :



                On Wednesday, July 20, 2016, Jan Schermer
                <j...@schermer.cz <mailto:j...@schermer.cz>> wrote:


                    > On 20 Jul 2016, at 18:38, Mike Christie
                    <mchri...@redhat.com <mailto:mchri...@redhat.com>>
                    wrote:
                    >
                    > On 07/20/2016 03:50 AM, Frédéric Nass wrote:
                    >>
                    >> Hi Mike,
                    >>
                    >> Thanks for the update on the RHCS iSCSI target.
                    >>
                    >> Will RHCS 2.1 iSCSI target be compliant with
                    VMWare ESXi client ? (or is
                    >> it too early to say / announce).
                    >
                    > No HA support for sure. We are looking into non
                    HA support though.
                    >
                    >>
                    >> Knowing that HA iSCSI target was on the
                    roadmap, we chose iSCSI over NFS
                    >> so we'll just have to remap RBDs to RHCS
                    targets when it's available.
                    >>
                    >> So we're currently running :
                    >>
                    >> - 2 LIO iSCSI targets exporting the same RBD
                    images. Each iSCSI target
                    >> has all VAAI primitives enabled and run the
                    same configuration.
                    >> - RBD images are mapped on each target using
                    the kernel client (so no
                    >> RBD cache).
                    >> - 6 ESXi. Each ESXi can access to the same LUNs
                    through both targets,
                    >> but in a failover manner so that each ESXi
                    always access the same LUN
                    >> through one target at a time.
                    >> - LUNs are VMFS datastores and VAAI primitives
                    are enabled client side
                    >> (except UNMAP as per default).
                    >>
                    >> Do you see anthing risky regarding this
                    configuration ?
                    >
                    > If you use a application that uses scsi
                    persistent reservations then you
                    > could run into troubles, because some apps
                    expect the reservation info
                    > to be on the failover nodes as well as the
                    active ones.
                    >
                    > Depending on the how you do failover and the
                    issue that caused the
                    > failover, IO could be stuck on the old active
                    node and cause data
                    > corruption. If the initial active node looses
                    its network connectivity
                    > and you failover, you have to make sure that the
                    initial active node is
                    > fenced off and IO stuck on that node will never
                    be executed. So do
                    > something like add it to the ceph monitor
                    blacklist and make sure IO on
                    > that node is flushed and failed before
                    unblacklisting it.
                    >

                    With iSCSI you can't really do hot failover unless
                    you only use synchronous IO.

                VMware does only use synchronous IO. Since the
                hypervisor can't tell what type of data the VMs are
                writing, all IO is treated as needing to be synchronous.

                    (With any of opensource target softwares available).
                    Flushing the buffers doesn't really help because
                    you don't know what in-flight IO happened before
                    the outage
                    and which didn't. You could end with only part of
                    the "transaction" written on persistent storage.

                    If you only use synchronous IO all the way from
                    client to the persistent storage shared between
                    iSCSI target then all should be fine, otherwise
                    YMMV - some people run it like that without realizing
                    the dangers and have never had a problem, so it
                    may be strictly theoretical, and it all depends on
                    how often you need to do the
                    failover and what data you are storing -
                    corrupting a few images on a gallery site could be
                    fine but corrupting
                    a large database tablespace is no fun at all.

                No, it's not. VMFS corruption is pretty bad too and
                there is no fsck for VMFS...


                    Some (non opensource) solutions exist, Solaris
                    supposedly does this in some(?) way, maybe some
                    iSCSI guru
                    can chime tell us what magic they do, but I don't
                    think it's possible without client support
                    (you essentialy have to do something like
                    transactions and replay the last transaction on
                    failover). Maybe
                    something can be enabled in protocol to do the
                    iSCSI IO synchronous or make it at least wait for
                    some sort of ACK from the
                    server (which would require some sort of cache
                    mirroring between the targets) without making it
                    synchronous all the way.

                This is why the SAN vendors wrote their own clients
                and drivers. It is not possible to dynamically make
                all OS's do what your iSCSI target expects.

                Something like VMware does the right thing pretty much
                all the time (there are some iSCSI initiator bugs in
                earlier ESXi 5.x).  If you have control of your ESXi
                hosts then attempting to set up HA iSCSI targets is
                possible.

                If you have a mixed client environment with various
                versions of Windows connecting to the target, you may
                be better off buying some SAN appliances.


                    The one time I had to use it I resorted to simply
                    mirroring in via mdraid on the client side over
                    two targets sharing the same
                    DAS, and this worked fine during testing but never
                    went to production in the end.

                    Jan

                    >
                    >>
                    >> Would you recommend LIO or STGT (with rbd
                    bs-type) target for ESXi
                    >> clients ?
                    >
                    > I can't say, because I have not used stgt with
                    rbd bs-type support enough.

                For starters, STGT doesn't implement VAAI properly and
                you will need to disable VAAI in ESXi.

                LIO does seem to implement VAAI properly, but
                performance is not nearly as good as STGT even with
                VAAI's benefits. The assumption for the cause is that
                LIO currently uses kernel rbd mapping and kernel rbd
                performance is not as good as librbd.

                I recently did a simple test of creating an 80GB eager
                zeroed disk with STGT (VAAI disabled, no rbd client
                cache) and LIO (VAAI enabled) and found that STGT was
                actually slightly faster.

                I think we're all holding our breath waiting for LIO
                librbd support via TCMU, which seems to be right
                around the corner. That solution will combine the
                performance benefits of librbd with the more
                feature-full LIO iSCSI interface. The lrbd
                configuration tool for LIO from SUSE is pretty cool
                and it makes configuring LIO easier than STGT.


            Hi Jake,

            Problem we're facing with LIO is that it has ESXs
            disconnecting from vCenter regularly. This is a result
            from the iSCSI datastore becoming unreachable.
            It's happens randomly, last time with almost no VM
            activity at all (only 6 VMs in the lab), but when ESX
            requested a write to '.iormstats.sf' file, which I suppose
            is related to storage I/O Control, but I'm not sure of that.

            Setting VMFS3.UseATSForHBOnVMFS5 to 0 didn't help.
            Restarting the LIO target almost instantly solves it.

            Any one of you ever encountered this issue with LIO target ?

            Yes, this is a current known problem that will hopefully
            be resolved soon. When there is a delay servicing IO, ESXi
            asks the target to cancel the IO, LIO tries to do this,
            but from what I understand, the RBD doesn’t have the API
            to allow LIO to reach into the Ceph cluster and cancel the
            in flight IO. LIO responds back, saying I can’t do this
            and then ESXi asks again. And so LIO and ESXi enter a loop
            forever.

        Hi Nick,

        Thanks for this explanation.

        Are you aware of any workaround or ESXi initiator option to
        tweak (like an I/O timeout value) to avoid that ?

        Or does this makes LIO target unusable with ESXi as of now ?

        Is STGT also affected or does it respond better with the rbd
        (librbd) backstore ?

        Check out my response in this thread

        
http://ceph-users.ceph.narkive.com/JFwme605/suse-enterprise-storage3-rbd-lio-vmware-performance-bad
        
<http://xo4t.mj.am/lnk/AEEAEYs7jkUAAAAAAAAAAF3gdvQAADNJBWwAAAAAAACRXwBXkg1KzpTz2fSKQW-7Mm6bk238tQAAlBI/1/RYXxq-Z6vxVT6nVxJiMA9Q/aHR0cDovL2NlcGgtdXNlcnMuY2VwaC5uYXJraXZlLmNvbS9KRndtZTYwNS9zdXNlLWVudGVycHJpc2Utc3RvcmFnZTMtcmJkLWxpby12bXdhcmUtcGVyZm9ybWFuY2UtYmFk>


    Nick,

    What a great post (#5) ! :-)

    It clearly states what I'm hitting with LIO (vmkernel.log) :
    2016-07-21T07:33:38.544Z cpu26:386324)WARNING: ScsiPath: 7154: Set
    retry timeout for failed TaskMgmt abort for CmdSN  0x0, status
    Failure, path vmhba40:C2:T1:L0

    Have you try STGT (with rbd backstore) ? I'll give SCST a try...

    Yep, but see my point about being unable to stop when there is
    ongoing IO, this makes clustering hard as you have to start adding
    resource agents to block/manipulate TCP packets to drain iscsi
    connections. I gave up trying to get it to work 100% reliably.



    When you say 'NFS is very easy to configure for HA', how that ?
    I thought it was something hard to achieve, involving clustering
    software as Corosync, Pacemaker, DRBD or GFS2. Am I missing
    something ? (NFS-Ganesha ?)

    Easy compared to iSCSI. Yes, you have to use pacemaker/corosync,
    but that’s the easy part of the whole process.


Ok. So this would be an active / passive scenario, right ?

The hard part seems to set the right fencing with the right commandson each NFS node. :-/It's not really clear to me whether an active, under load, NFS serverwill accept to shutdown gracefully, so you can unmap the RBD withoutfear and have it remmaped on the other node.


Frederic.

That’s where stonith comes in to play. If the resource ever gets intoa state where it can’t stop, it will be marked unclean and thenstonith will reboot the node to resolve the situation.

And then ESXi would drop connections, follow the ViP moving to the otherNFS gateway, and resume its workload without pain ? What about NFS locks ?


Frederic.


    There’s a lot of things that can go wrong doing clustered iscsi,
    whereas I have found NFS to be much simpler. ESXi seems to handle
    NFS failure better. With iSCSI unless you catch it quickly
    everything goes APD/PDL and you end up with all sorts of problems.
    NFS seems to be able to disappear and then pop back with no drama
    from what I have seen so far.



    Again thanks for you help,

    Frederic.


    Image removed by sender.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

Reply via email to