Hi Blair!

On 9 September 2014 08:47, Blair Bethwaite <[email protected]> wrote:
> Hi Dan,
>
> Thanks for sharing!
>
> On 9 September 2014 20:12, Dan Van Der Ster <[email protected]> wrote:
>> We do this for some small scale NAS use-cases, with ZFS running in a VM with 
>> rbd volumes. The performance is not great (especially since we throttle the 
>> IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — 
>> the SSD solves any performance problem we ever had with ZFS on RBD.
>
> That's good to hear. My limited experience doing this on a smaller Ceph 
> cluster (and without any SSD journals or cache devices for ZFS
> head) points to write latency being an immediate issue, decent PCIe SLC SSD 
> devices should pretty much sort that out given the cluster itself has plenty 
> of write throughput available. Then there's further MLC devices for L2ARC - 
> not sure yet but guessing metadata heavy datasets might require 
> primarycache=metadata and rely of L2ARC for data cache. And all this should 
> get better in the medium term with performance improvements and RDMA 
> capability (we're building this with that option in the hole).
>

I'd love to go back and forth with you privately or on one of the ZFS 
mailing-lists if you want to discuss ZFS tuning in depth, but I want to just 
mention that setting primarycache=metadata will also cause the L2ARC to ONLY 
store and accelerate metadata as well(despite whatever secondarycache is set 
to). I believe this is something that the ZFS developers are looking to improve 
eventually but as-is, currently that’s how it works (L2ARC only contains what 
was pushed out of the main in-memory ARC). 

>> I would say though that this setup is rather adventurous. ZoL is not rock 
>> solid — we’ve had a few lockups in testing, all of which have been fixed in 
>> the latest ZFS code in git (my colleague in CC could elaborate if you’re 
>> interested).
>
> Hmm okay, that's not great. The only problem I've experienced thus far is 
> when the ZoL repos stopped providing DKMS and borked an upgrade for me until 
> I figured out what had happened and cleaned up the old .ko files. So yes, 
> interested to hear elaboration on that.
>

You mentioned in one of your other emails that if you deployed this idea of a 
ZFS NFS server, you'd do it inside a KVM VM and make use of librbd rather than 
krbd. If you're worried about ZoL stability and feel comfortable going outside 
Linux, you could always go with a *BSD or Illumos distro where ZFS support is 
much more stable/solid. 
In any case I haven't had any major show stopping issues with ZoL myself and I 
use it heavily. Still, unless you're really comfortable with ZoL or 
*BSD/Illumos(as I am), I'd likely recommend looking into other solutions.

>> One thing I’m not comfortable with is the idea of ZFS checking the data in 
>> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
>> without any redundancy at the ZFS layer there will be no way to correct that 
>> error. Of course, the hope is that RADOS will ensure 100% data consistency, 
>> but what happens if not?...
> 
> The ZFS checksumming would tell us if there has been any corruption, which as 
> you've pointed out shouldn't happen anyway on top of Ceph.

Just want to quickly address this, someone correct me if I'm wrong, but IIRC 
even with replica value of 3 or more, ceph does not(currently) have any 
intelligence when it detects a corrupted/"incorrect" PG, it will always 
replace/repair the PG with whatever data is in the primary, meaning that if the 
primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will replace 
the good replicas with the bad.  

> But if we did have some awful disaster scenario where that happened then we'd 
> be restoring from tape, and it'd sure be good to know which files actually 
> needed restoring. I.e., if we lost a single PG at the Ceph level then we 
> don't want to have to blindly restore the whole zpool or dataset.
>
>> Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. 
>> If I were you I would seriously evaluate the CephFS option. It used to be on 
>> the roadmap for ICE 2.0 coming out this fall, though I noticed its not there 
>> anymore (??!!!).
>
> Yeah, it's very disappointing that this was silently removed. And it's 
> particularly concerning that this happened post RedHat acquisition.
> I'm an ICE customer and sure would have liked some input there for exactly 
> the reason we're discussing.
>

I'm looking forward to CephFS as well, and I agree, it's somewhat concerning 
that it happened post RedHat acquisition. I'm hoping RedHat pours more 
resources into InkTank and ceph, and not instead leach resources away from them.

>> Anyway I would say that ZoL on kRBD is not necessarily a more stable 
>> solution than CephFS. Even Gluster striped on top of RBD would probably be 
>> more stable than ZoL on RBD.
>
> If we really have to we'll just run Gluster natively instead (or perhaps XFS 
> on RBD as the option before that) - the hardware needn't change for that 
> except to configure RAIDs rather than JBODs on the servers.

Really, I would look into RBD backed HA NFS based solutions like Christian 
Balzer brought up in one of the previous emails. I'm sure setting up a couple 
librbd KVM backed VMs in a Active/Passive or Active+Passive/Passive+Active type 
NFS solution wouldn’t be too hard to set-up and would likely be the more stable 
solution.

>
> --
> Cheers,
> ~Blairo

Cheers
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to