Hi Blair! On 9 September 2014 08:47, Blair Bethwaite <[email protected]> wrote: > Hi Dan, > > Thanks for sharing! > > On 9 September 2014 20:12, Dan Van Der Ster <[email protected]> wrote: >> We do this for some small scale NAS use-cases, with ZFS running in a VM with >> rbd volumes. The performance is not great (especially since we throttle the >> IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — >> the SSD solves any performance problem we ever had with ZFS on RBD. > > That's good to hear. My limited experience doing this on a smaller Ceph > cluster (and without any SSD journals or cache devices for ZFS > head) points to write latency being an immediate issue, decent PCIe SLC SSD > devices should pretty much sort that out given the cluster itself has plenty > of write throughput available. Then there's further MLC devices for L2ARC - > not sure yet but guessing metadata heavy datasets might require > primarycache=metadata and rely of L2ARC for data cache. And all this should > get better in the medium term with performance improvements and RDMA > capability (we're building this with that option in the hole). >
I'd love to go back and forth with you privately or on one of the ZFS mailing-lists if you want to discuss ZFS tuning in depth, but I want to just mention that setting primarycache=metadata will also cause the L2ARC to ONLY store and accelerate metadata as well(despite whatever secondarycache is set to). I believe this is something that the ZFS developers are looking to improve eventually but as-is, currently that’s how it works (L2ARC only contains what was pushed out of the main in-memory ARC). >> I would say though that this setup is rather adventurous. ZoL is not rock >> solid — we’ve had a few lockups in testing, all of which have been fixed in >> the latest ZFS code in git (my colleague in CC could elaborate if you’re >> interested). > > Hmm okay, that's not great. The only problem I've experienced thus far is > when the ZoL repos stopped providing DKMS and borked an upgrade for me until > I figured out what had happened and cleaned up the old .ko files. So yes, > interested to hear elaboration on that. > You mentioned in one of your other emails that if you deployed this idea of a ZFS NFS server, you'd do it inside a KVM VM and make use of librbd rather than krbd. If you're worried about ZoL stability and feel comfortable going outside Linux, you could always go with a *BSD or Illumos distro where ZFS support is much more stable/solid. In any case I haven't had any major show stopping issues with ZoL myself and I use it heavily. Still, unless you're really comfortable with ZoL or *BSD/Illumos(as I am), I'd likely recommend looking into other solutions. >> One thing I’m not comfortable with is the idea of ZFS checking the data in >> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but >> without any redundancy at the ZFS layer there will be no way to correct that >> error. Of course, the hope is that RADOS will ensure 100% data consistency, >> but what happens if not?... > > The ZFS checksumming would tell us if there has been any corruption, which as > you've pointed out shouldn't happen anyway on top of Ceph. Just want to quickly address this, someone correct me if I'm wrong, but IIRC even with replica value of 3 or more, ceph does not(currently) have any intelligence when it detects a corrupted/"incorrect" PG, it will always replace/repair the PG with whatever data is in the primary, meaning that if the primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will replace the good replicas with the bad. > But if we did have some awful disaster scenario where that happened then we'd > be restoring from tape, and it'd sure be good to know which files actually > needed restoring. I.e., if we lost a single PG at the Ceph level then we > don't want to have to blindly restore the whole zpool or dataset. > >> Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. >> If I were you I would seriously evaluate the CephFS option. It used to be on >> the roadmap for ICE 2.0 coming out this fall, though I noticed its not there >> anymore (??!!!). > > Yeah, it's very disappointing that this was silently removed. And it's > particularly concerning that this happened post RedHat acquisition. > I'm an ICE customer and sure would have liked some input there for exactly > the reason we're discussing. > I'm looking forward to CephFS as well, and I agree, it's somewhat concerning that it happened post RedHat acquisition. I'm hoping RedHat pours more resources into InkTank and ceph, and not instead leach resources away from them. >> Anyway I would say that ZoL on kRBD is not necessarily a more stable >> solution than CephFS. Even Gluster striped on top of RBD would probably be >> more stable than ZoL on RBD. > > If we really have to we'll just run Gluster natively instead (or perhaps XFS > on RBD as the option before that) - the hardware needn't change for that > except to configure RAIDs rather than JBODs on the servers. Really, I would look into RBD backed HA NFS based solutions like Christian Balzer brought up in one of the previous emails. I'm sure setting up a couple librbd KVM backed VMs in a Active/Passive or Active+Passive/Passive+Active type NFS solution wouldn’t be too hard to set-up and would likely be the more stable solution. > > -- > Cheers, > ~Blairo Cheers _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
