Re: [ceph-users] anyone using CephFS for HPC?
I am currently implementing Ceph into our HPC environment to handle SAS temp workspace. I am starting out with 3 OSD nodes with 1 MON/MDS node. 16 4TB HDDs per OSD node with 4 120GB SSD. Each node has 40Gb Mellanox interconnect between each other to a Mellanox switch. Each client node has 10Gb to switch. I have not done comparisons to Lustre but I have done comparisons to PanFS which we currently use in production. I have found that most workflows Ceph is comparibale to PanFS if not better; however, PanFS still does better with small IO due to how it caches small files. If you want I can give you some hard numbers. almightybeeij On Fri, Jun 12, 2015 at 12:31 AM, Nigel Williams nigel.d.willi...@gmail.com wrote: Wondering if anyone has done comparisons between CephFS and other parallel filesystems like Lustre typically used in HPC deployments either for scratch storage or persistent storage to support HPC workflows? thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] anyone using CephFS for HPC?
Thanks for your info. I would like to know how large i/o that you mentioned, and what kind of app you used to do benchmarking? Sincerely, Kinjo On Tue, Jun 16, 2015 at 12:04 AM, Barclay Jameson almightybe...@gmail.com wrote: I am currently implementing Ceph into our HPC environment to handle SAS temp workspace. I am starting out with 3 OSD nodes with 1 MON/MDS node. 16 4TB HDDs per OSD node with 4 120GB SSD. Each node has 40Gb Mellanox interconnect between each other to a Mellanox switch. Each client node has 10Gb to switch. I have not done comparisons to Lustre but I have done comparisons to PanFS which we currently use in production. I have found that most workflows Ceph is comparibale to PanFS if not better; however, PanFS still does better with small IO due to how it caches small files. If you want I can give you some hard numbers. almightybeeij On Fri, Jun 12, 2015 at 12:31 AM, Nigel Williams nigel.d.willi...@gmail.com wrote: Wondering if anyone has done comparisons between CephFS and other parallel filesystems like Lustre typically used in HPC deployments either for scratch storage or persistent storage to support HPC workflows? thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Life w/ Linux http://i-shinobu.hatenablog.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] anyone using CephFS for HPC?
On 12/06/2015 3:41 PM, Gregory Farnum wrote: ... and the test evaluation was on repurposed Lustre hardware so it was a bit odd, ... Agree, it was old (at least by now) DDN kit (SFA10K?) and not ideally suited for Ceph (really high OSD per host ratio). Sage's thesis or some of the earlier papers will be happy to tell you all the ways in which Ceph Lustre, of course, since creating a successor is how the project started. ;) -Greg Thanks Greg, yes those original documents have been well-thumbed; but I was hoping someone had done a more recent comparison given the significant improvements over the last couple of Ceph releases. My superficial poking about in Lustre doesn't reveal to me anything particularly compelling in the design or typical deployments that would magically yield higher-performance than an equally well tuned Ceph cluster. Blair Bethwaite commented that Lustre client-side write caching might be more effective than CephFS at the moment. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] anyone using CephFS for HPC?
On 06/14/2015 06:53 PM, Nigel Williams wrote: On 12/06/2015 3:41 PM, Gregory Farnum wrote: ... and the test evaluation was on repurposed Lustre hardware so it was a bit odd, ... Agree, it was old (at least by now) DDN kit (SFA10K?) and not ideally suited for Ceph (really high OSD per host ratio). FWIW, I did most of the performance work on the Ceph side for that paper. Let me know if you are interested in any of the details. It was definitely not ideal, though in the end we did relatively well I think. Ultimately the lack of SSD journals hurt us as we hit the IB limit to the SFA10K long before we hit the disk limits, and we were topping out at about 6-8GB/s for sequential reads when we should have been able to hit 12GB/s. We have seen some cases where filestore doesn't do large reads as quickly as you'd think (newstore seems to do better). The big things that took a lot of effort to figure out during this testing were: - General strange - cache mirroring on the SFA10k *really* hurting performance with Ceph (Not sure why it didn't hurt Lustre as badly) - Back around kernel 3.6 there were some nasty VM compaction issues that caused major performance problems. - Somewhat strange mdtest results. Probably just issues in the MDS back then. Sage's thesis or some of the earlier papers will be happy to tell you all the ways in which Ceph Lustre, of course, since creating a successor is how the project started. ;) -Greg Thanks Greg, yes those original documents have been well-thumbed; but I was hoping someone had done a more recent comparison given the significant improvements over the last couple of Ceph releases. My superficial poking about in Lustre doesn't reveal to me anything particularly compelling in the design or typical deployments that would magically yield higher-performance than an equally well tuned Ceph cluster. Blair Bethwaite commented that Lustre client-side write caching might be more effective than CephFS at the moment. I suspect the big things are: - Lustre doesn't do asynchronous replication (relies on hardware raid) - Lustre may have more tuning issues worked out. - Lustre doesn't (last I checked) do full data journaling. Frankly a well-tuned Lustre configuration is going to do pretty well for large sequential IO. That's pretty much it's bread and butter. At least historically it's not been great at small random IO, and most lustre setups use some kind of STONITH setup for node outage which is obviously not nearly as nice as Ceph is. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] anyone using CephFS for HPC?
Wondering if anyone has done comparisons between CephFS and other parallel filesystems like Lustre typically used in HPC deployments either for scratch storage or persistent storage to support HPC workflows? thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] anyone using CephFS for HPC?
On Thu, Jun 11, 2015 at 10:31 PM, Nigel Williams nigel.d.willi...@gmail.com wrote: Wondering if anyone has done comparisons between CephFS and other parallel filesystems like Lustre typically used in HPC deployments either for scratch storage or persistent storage to support HPC workflows? Oak Ridge had a paper at Supercomputing a couple years ago about this from their perspective. I don't remember how many of its concerns are still up-to-date, and the test evaluation was on repurposed Lustre hardware so it was a bit odd, but it might give you some stuff to think about. Sage's thesis or some of the earlier papers will be happy to tell you all the ways in which Ceph Lustre, of course, since creating a successor is how the project started. ;) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com