As I parse Youssef’s message, I believe there are some misconceptions.  It 
might help if you could give a bit more info on what your existing ‘cluster’ is 
running.  NFS? CIFS/SMB?  Something else?

1) Ceph regularly runs scrubs to ensure that all copies of data are consistent. 
 The checksumming that you describe would be both infeasible and redundant.

2) It sounds as though your current back-end stores user files as-is and is 
either a traditional file server setup or perhaps a virtual filesystem 
aggregating multiple filesystems.  Ceph is not a file storage solution in this 
sense.  The below sounds as though you want user files to not be sharded across 
multiple servers.  This is antithetical to how Ceph works and is counter to 
data durability and availability, unless there is some replication that you 
haven’t described.  Reference this diagram:

http://docs.ceph.com/docs/master/_images/stack.png

Beneath the hood Ceph operates internally on ‘objects’ that are not exposed to 
clients as such. There are several different client interfaces that are built 
on top of this block service:

- RBD volumes — think in terms of a virtual disk drive attached to a VM
- RGW — like Amazon S3 or Swift
- CephFS — provides a mountable filesystem interface, somewhat like NFS or even 
SMB but with important distictions in behavior and use-case

I had not heard of iRODS before but just looked it up.  It is a very different 
thing than any of the common interfaces to Ceph.

If your users need to mount the storage as a share / volume, in the sense of 
SMB or NFS, then Ceph may not be your best option.  If they can cope with an S3 
/ Swift type REST object interface, a cluster with RGW interfaces might do the 
job, or perhaps Swift or Gluster.   It’s hard to say for sure based on 
assumptions of what you need.

— Anthony


> We currently run a commodity cluster that supports a few petabytes of data. 
> Each node in the cluster has 4 drives, currently mounted as /0 through /3. We 
> have been researching alternatives for managing the storage, Ceph being one 
> possibility, iRODS being another. For preservation purposes, we would like 
> each file to exist as one whole piece per drive (as opposed to being striped 
> across multiple drives). It appears this is the default in Ceph.
> 
> Now, it has always been convenient for us to run distributed jobs over SSH 
> to, for instance, compile a list of checksums of all files in the cluster:
> 
> dsh -Mca 'find /{0..3}/items -name \*.warc.gz | xargs md5sum 
> >/tmp/$HOSTNAME.md5sum'
> 
> And that nicely allows each node to process its own files using the local CPU.
> 
> Would this scenario still be possible where Ceph is managing the storage?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to