Hi Christian,


> > Hi David,
> >
> > The planned usage for this CephFS cluster is scratch space for an image
> > processing cluster with 100+ processing nodes.
>
> Lots of clients, how much data movement would you expect, how many images
> come in per timeframe, lets say an hour?
> Typical size of a image?
>
> Does an image come in and then gets processed by one processing node?
> Unlikely to be touched again, at least in the short term?
> Probably being deleted after being processed?
>

We'd typically get up to 6TB of raw imagery per day at an average image
size of 20MB.  There's a complex multi stage processing chain that happens
- typically images are read by multiple nodes with intermediate data
generated and processed again by multiple nodes.  This would generate about
30TB of intermediate data.  The end result would be around 9TB of final
processed data.  Once the processing is complete and the final data is
copied off and completed QA, the entire data set is deleted.  The data sets
could remain on the file system for up to 2 weeks before deletion.



> >  My thinking is we'd be
> > better off with a large number (100+) of storage hosts with 1-2 OSD's
> each,
> > rather than 10 or so storage nodes with 10+ OSD's to get better
> parallelism
> > but I don't have any practical experience with CephFS to really judge.
> CephFS is one thing (of which I have very limited experience), but at this
> point you're talking about parallelism in Ceph (RBD).
> And that happens much more on an OSD than host level.
>
> Which you _can_ achieve with larger nodes, if they're well designed.
> Meaning CPU/RAM/interal storage bandwidth/network bandwidth being in
> "harmony".
>

I'm not sure what you mean about the RBD reference.  Does CephFS use RBD
internally?


>
> Also you keep talking about really huge HDDs, you could do worse than
> halving their size and doubling their numbers to achieve much more
> bandwidth and the ever crucial IOPS (even in your use case).
>
> So something like 20x 12 HDD servers, with SSDs/NVMes for journal/bluestore
> wAL/DB if you can afford or actually need it.
>
> CephFS metadata on a SSD pool isn't the most dramatic improvement one can
> do (or so people tell me), but given your budget it may be worthwhile.
>
>
Yes, I totally get the benefits of using greater numbers of smaller HDD's.
One of the requirements is to keep $/TB low and large capacity drives helps
with that.  I guess we need to look at the tradeoff of $/TB vs number of
spindles for performance.

If CephFS's parallelism happens more at the OSD level than the host level
then perhaps the 12 disk storage host would be fine as long as
"mon_osd_down_out_subtree_limit = host" and there's enough CPU/RAM/BUS and
Network bandwidth on the host.  I'm doing some cost comparisons of these
"big" servers vs multiple "small" servers such as the supermicro microcloud
chassis or the Ambedded Mars 200 ARM cluster (which looks very
interesting).  However, cost is not the sole consideration, so I'm hoping
to get an idea of performance differences between the two architectures to
help with the decision making process given the lack of test equipment
available.



>
> > And
> > I don't have enough hardware to setup a test cluster of any significant
> > size to run some actual testing.
> >
> You may want to set up something to get a feeling for CephFS, if it's
> right for you or if something else on top of RBD may be more suitable.
>
>
I've setup a 3 node cluster, 2 OSD servers and 1 mon/mds to get a feel for
ceph and cephFS.  It looks pretty straightforward and performs well enough
given the lack of nodes.


Thanks,
Nick


> Christian
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Rakuten Communications
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to