Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
On Mon, Mar 19, 2018 at 11:45 PM, Nicolas Huillard wrote: > Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit : >> Default for mds_log_events_per_segment is 1024, in my set up I ended >> up with 8192. >> I calculated that value like IOPS / log segments * 5 seconds (afaik >> MDS performs journal maintenance once in 5 seconds by default). > > I tried 4096 from the initial 1024, then 8192 at the time of your > answer, then 16384, with not much improvements... > > Then I tried to reduce the number of MDS, from 4 to 1, which definitely > works (sorry if my initial mail didn't make it very clear that I was > using many MDSs, even though it mentioned mds.2). > I now have low rate of metadata write (40-50kBps), and the inter-DC > link load reflects the size and direction of the actual data. > > I'll now try to reduce mds_log_events_per_segment back to its original > value (1024), because performance is not optimal, and stutters a bit > too much. > > Thanks for your advice! > This seems like load balancer bug. Improving load balancer is on the top of our todo list. Regards Yan, Zheng > -- > Nicolas Huillard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote: > > Then I tried to reduce the number of MDS, from 4 to 1, > Le lundi 19 mars 2018 à 19:15 +0300, Sergey Malinin a écrit : > Forgot to mention, that in my setup the issue gone when I had > reverted back to single MDS and switched dirfrag off. So it appears we had the same problem, and applied the same solution ;- ) I reverted mds_log_events_per_segment back to 1024 without problems. Bandwidth utilisation is OK, destination (single SATA disk) throughput depends on file sizes (lots of tiny file = 1MBps ; big files = 30MBps), and running 2 rsync in parallel only improve things. Thanks! -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
Forgot to mention, that in my setup the issue gone when I had reverted back to single MDS and switched dirfrag off. On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote: > Then I tried to reduce the number of MDS, from 4 to 1, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit : > Default for mds_log_events_per_segment is 1024, in my set up I ended > up with 8192. > I calculated that value like IOPS / log segments * 5 seconds (afaik > MDS performs journal maintenance once in 5 seconds by default). I tried 4096 from the initial 1024, then 8192 at the time of your answer, then 16384, with not much improvements... Then I tried to reduce the number of MDS, from 4 to 1, which definitely works (sorry if my initial mail didn't make it very clear that I was using many MDSs, even though it mentioned mds.2). I now have low rate of metadata write (40-50kBps), and the inter-DC link load reflects the size and direction of the actual data. I'll now try to reduce mds_log_events_per_segment back to its original value (1024), because performance is not optimal, and stutters a bit too much. Thanks for your advice! -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
Default for mds_log_events_per_segment is 1024, in my set up I ended up with 8192. I calculated that value like IOPS / log segments * 5 seconds (afaik MDS performs journal maintenance once in 5 seconds by default). On Monday, March 19, 2018 at 15:20, Nicolas Huillard wrote: > I can't find any doc about that mds_log_events_per_segment setting, > specially on how to choose a good value. > Can you elaborate on "original value multiplied several times" ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
Le lundi 19 mars 2018 à 10:01 +, Sergey Malinin a écrit : > I experienced the same issue and was able to reduce metadata writes > by raising mds_log_events_per_segment to > it’s original value multiplied several times. I changed it from 1024 to 4096 : * rsync status (1 line per file) scrolls much quicker * OSD writes on the dashboard is much lower than reads now (it was much higher before) * metadata pool write rate in the 20-800kBps range now, while metadata reads in the 20-80kBps * data pool reads is in the hundreds of kBps, which still seems very low * destination disk write rate is a bit larger than the data pool read rate (expected for btrfs), but still low * inter-DC network load is now 1-50Mbps I'll monitor the Munin graphs in the long run. I can't find any doc about that mds_log_events_per_segment setting, specially on how to choose a good value. Can you elaborate on "original value multiplied several times" ? I'm just seeing more MDS_TRIM warnings now. Maybe restarting the MDSs just delayed re-emergence of the initial problem. > > From: ceph-users on behalf of > Nicolas Huillard > Sent: Monday, March 19, 2018 12:01:09 PM > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Huge amount of cephfs metadata writes while > only reading data (rsync from storage, to single disk) > > Hi all, > > I'm experimenting with a new little storage cluster. I wanted to take > advantage of the week-end to copy all data (1TB, 10M objects) from > the > cluster to a single SATA disk. I expected to saturate the SATA disk > while writing to it, but the storage cluster actually saturates its > network links, while barely writing to the destination disk (63GB > written in 20h, that's less than 1MBps). > > Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, > Luminous > 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link > between datacenters (12ms latency). 4 clients using a single cephfs > storing data + metadata on the same spinning disks with bluestore. > > Test : I'm using a single rsync on one of the client servers (the > other > 3 are just sitting there). rsync is local to the client, copying from > the cephfs mount (kernel client on 4.14 from stretch-backports, just > to > use a potentially more recent cephfs client than on stock 4.9), to > the > SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on > deep directory branches, along with some large files (10-100MB) in a > few directories. There is no other activity on the cluster. > > Observations : I initially saw write performance on the destination > disk from a few 100kBps (during exploration of branches with tiny > file) > to a few 10MBps (while copying large files), essentially seeing the > file names scrolling at a relatively fixed rate, unrelated to their > individual size. > After 5 hours, the fibre link stated to saturate at 200Mbps, while > destination disk writes is down to a few 10kBps. > > Using the dashboard, I see lots of metadata writes, at 30MBps rate on > the metadata pool, which correlates to the 200Mbps link rate. > It also shows regular "Health check failed: 1 MDSs behind on trimming > (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming > (64/30)". > > I wonder why cephfs would write anything to the metadata (I'm > mounting > on the clients with "noatime"), while I'm just reading data from > it... > What could I tune to reduce that write-load-while-reading-only ? > > -- > Nicolas Huillard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Nicolas Huillard Associé fondateur - Directeur Technique - Dolomède nhuill...@dolomede.fr Fixe : +33 9 52 31 06 10 Mobile : +33 6 50 27 69 08 http://www.dolomede.fr/ https://reseauactionclimat.org/planetman/ http://climat-2020.eu/ http://www.350.org/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
I experienced the same issue and was able to reduce metadata writes by raising mds_log_events_per_segment to it’s original value multiplied several times. From: ceph-users on behalf of Nicolas Huillard Sent: Monday, March 19, 2018 12:01:09 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk) Hi all, I'm experimenting with a new little storage cluster. I wanted to take advantage of the week-end to copy all data (1TB, 10M objects) from the cluster to a single SATA disk. I expected to saturate the SATA disk while writing to it, but the storage cluster actually saturates its network links, while barely writing to the destination disk (63GB written in 20h, that's less than 1MBps). Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link between datacenters (12ms latency). 4 clients using a single cephfs storing data + metadata on the same spinning disks with bluestore. Test : I'm using a single rsync on one of the client servers (the other 3 are just sitting there). rsync is local to the client, copying from the cephfs mount (kernel client on 4.14 from stretch-backports, just to use a potentially more recent cephfs client than on stock 4.9), to the SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on deep directory branches, along with some large files (10-100MB) in a few directories. There is no other activity on the cluster. Observations : I initially saw write performance on the destination disk from a few 100kBps (during exploration of branches with tiny file) to a few 10MBps (while copying large files), essentially seeing the file names scrolling at a relatively fixed rate, unrelated to their individual size. After 5 hours, the fibre link stated to saturate at 200Mbps, while destination disk writes is down to a few 10kBps. Using the dashboard, I see lots of metadata writes, at 30MBps rate on the metadata pool, which correlates to the 200Mbps link rate. It also shows regular "Health check failed: 1 MDSs behind on trimming (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)". I wonder why cephfs would write anything to the metadata (I'm mounting on the clients with "noatime"), while I'm just reading data from it... What could I tune to reduce that write-load-while-reading-only ? -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
The MDS has to write to its local journal when clients open files, in case of certain kinds of failures. I guess it doesn't distinguish between read-only (when it could *probably* avoid writing them down? Although it's not as simple a thing as it sounds) and writeable file opens. So every file you're opening requires the MDS to commit to disk, and it apparently filled up its allowable mds log size and now you're stuck on that inter-DC link. A temporary workaround might be to just keep turning up the mds log sizes, but I'm sort of surprised it was absorbing stuff at a useful rate before, so I don't know if changing those will help or not. -Greg On Mon, Mar 19, 2018 at 5:01 PM, Nicolas Huillard wrote: > Hi all, > > I'm experimenting with a new little storage cluster. I wanted to take > advantage of the week-end to copy all data (1TB, 10M objects) from the > cluster to a single SATA disk. I expected to saturate the SATA disk > while writing to it, but the storage cluster actually saturates its > network links, while barely writing to the destination disk (63GB > written in 20h, that's less than 1MBps). > > Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous > 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link > between datacenters (12ms latency). 4 clients using a single cephfs > storing data + metadata on the same spinning disks with bluestore. > > Test : I'm using a single rsync on one of the client servers (the other > 3 are just sitting there). rsync is local to the client, copying from > the cephfs mount (kernel client on 4.14 from stretch-backports, just to > use a potentially more recent cephfs client than on stock 4.9), to the > SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on > deep directory branches, along with some large files (10-100MB) in a > few directories. There is no other activity on the cluster. > > Observations : I initially saw write performance on the destination > disk from a few 100kBps (during exploration of branches with tiny file) > to a few 10MBps (while copying large files), essentially seeing the > file names scrolling at a relatively fixed rate, unrelated to their > individual size. > After 5 hours, the fibre link stated to saturate at 200Mbps, while > destination disk writes is down to a few 10kBps. > > Using the dashboard, I see lots of metadata writes, at 30MBps rate on > the metadata pool, which correlates to the 200Mbps link rate. > It also shows regular "Health check failed: 1 MDSs behind on trimming > (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)". > > I wonder why cephfs would write anything to the metadata (I'm mounting > on the clients with "noatime"), while I'm just reading data from it... > What could I tune to reduce that write-load-while-reading-only ? > > -- > Nicolas Huillard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)
Hi all, I'm experimenting with a new little storage cluster. I wanted to take advantage of the week-end to copy all data (1TB, 10M objects) from the cluster to a single SATA disk. I expected to saturate the SATA disk while writing to it, but the storage cluster actually saturates its network links, while barely writing to the destination disk (63GB written in 20h, that's less than 1MBps). Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link between datacenters (12ms latency). 4 clients using a single cephfs storing data + metadata on the same spinning disks with bluestore. Test : I'm using a single rsync on one of the client servers (the other 3 are just sitting there). rsync is local to the client, copying from the cephfs mount (kernel client on 4.14 from stretch-backports, just to use a potentially more recent cephfs client than on stock 4.9), to the SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on deep directory branches, along with some large files (10-100MB) in a few directories. There is no other activity on the cluster. Observations : I initially saw write performance on the destination disk from a few 100kBps (during exploration of branches with tiny file) to a few 10MBps (while copying large files), essentially seeing the file names scrolling at a relatively fixed rate, unrelated to their individual size. After 5 hours, the fibre link stated to saturate at 200Mbps, while destination disk writes is down to a few 10kBps. Using the dashboard, I see lots of metadata writes, at 30MBps rate on the metadata pool, which correlates to the 200Mbps link rate. It also shows regular "Health check failed: 1 MDSs behind on trimming (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)". I wonder why cephfs would write anything to the metadata (I'm mounting on the clients with "noatime"), while I'm just reading data from it... What could I tune to reduce that write-load-while-reading-only ? -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com