Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-20 Thread Yan, Zheng
On Mon, Mar 19, 2018 at 11:45 PM, Nicolas Huillard
 wrote:
> Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit :
>> Default for mds_log_events_per_segment is 1024, in my set up I ended
>> up with 8192.
>> I calculated that value like IOPS / log segments * 5 seconds (afaik
>> MDS performs journal maintenance once in 5 seconds by default).
>
> I tried 4096 from the initial 1024, then 8192 at the time of your
> answer, then 16384, with not much improvements...
>
> Then I tried to reduce the number of MDS, from 4 to 1, which definitely
> works (sorry if my initial mail didn't make it very clear that I was
> using many MDSs, even though it mentioned mds.2).
> I now have low rate of metadata write (40-50kBps), and the inter-DC
> link load reflects the size and direction of the actual data.
>
> I'll now try to reduce mds_log_events_per_segment back to its original
> value (1024), because performance is not optimal, and stutters a bit
> too much.
>
> Thanks for your advice!
>

This seems like load balancer bug. Improving load balancer is on the
top of our todo list.

Regards
Yan, Zheng

> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote:
> > Then I tried to reduce the number of MDS, from 4 to 1, 

> Le lundi 19 mars 2018 à 19:15 +0300, Sergey Malinin a écrit :
> Forgot to mention, that in my setup the issue gone when I had
> reverted back to single MDS and switched dirfrag off. 

So it appears we had the same problem, and applied the same solution ;-
)
I reverted mds_log_events_per_segment back to 1024 without problems.

Bandwidth utilisation is OK, destination (single SATA disk) throughput
depends on file sizes (lots of tiny file = 1MBps ; big files = 30MBps),
and running 2 rsync in parallel only improve things.

Thanks!

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
Forgot to mention, that in my setup the issue gone when I had reverted back to 
single MDS and switched dirfrag off. 


On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote:

> Then I tried to reduce the number of MDS, from 4 to 1, 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit :
> Default for mds_log_events_per_segment is 1024, in my set up I ended
> up with 8192.
> I calculated that value like IOPS / log segments * 5 seconds (afaik
> MDS performs journal maintenance once in 5 seconds by default).

I tried 4096 from the initial 1024, then 8192 at the time of your
answer, then 16384, with not much improvements...

Then I tried to reduce the number of MDS, from 4 to 1, which definitely
works (sorry if my initial mail didn't make it very clear that I was
using many MDSs, even though it mentioned mds.2).
I now have low rate of metadata write (40-50kBps), and the inter-DC
link load reflects the size and direction of the actual data.

I'll now try to reduce mds_log_events_per_segment back to its original
value (1024), because performance is not optimal, and stutters a bit
too much.

Thanks for your advice!

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
Default for mds_log_events_per_segment is 1024, in my set up I ended up with 
8192.
I calculated that value like IOPS / log segments * 5 seconds (afaik MDS 
performs journal maintenance once in 5 seconds by default).


On Monday, March 19, 2018 at 15:20, Nicolas Huillard wrote:

> I can't find any doc about that mds_log_events_per_segment setting,
> specially on how to choose a good value.
> Can you elaborate on "original value multiplied several times" ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
Le lundi 19 mars 2018 à 10:01 +, Sergey Malinin a écrit :
> I experienced the same issue and was able to reduce metadata writes
> by raising mds_log_events_per_segment to
> it’s original value multiplied several times.

I changed it from 1024 to 4096 :
* rsync status (1 line per file) scrolls much quicker
* OSD writes on the dashboard is much lower than reads now (it was much
higher before)
* metadata pool write rate in the 20-800kBps range now, while metadata
reads in the 20-80kBps
* data pool reads is in the hundreds of kBps, which still seems very
low
* destination disk write rate is a bit larger than the data pool read
rate (expected for btrfs), but still low
* inter-DC network load is now 1-50Mbps

I'll monitor the Munin graphs in the long run.

I can't find any doc about that mds_log_events_per_segment setting,
specially on how to choose a good value.
Can you elaborate on "original value multiplied several times" ?

I'm just seeing more MDS_TRIM warnings now. Maybe restarting the MDSs
just delayed re-emergence of the initial problem.

> 
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of
> Nicolas Huillard <nhuill...@dolomede.fr>
> Sent: Monday, March 19, 2018 12:01:09 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Huge amount of cephfs metadata writes while
> only reading data (rsync from storage, to single disk)
> 
> Hi all,
> 
> I'm experimenting with a new little storage cluster. I wanted to take
> advantage of the week-end to copy all data (1TB, 10M objects) from
> the
> cluster to a single SATA disk. I expected to saturate the SATA disk
> while writing to it, but the storage cluster actually saturates its
> network links, while barely writing to the destination disk (63GB
> written in 20h, that's less than 1MBps).
> 
> Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each,
> Luminous
> 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
> between datacenters (12ms latency). 4 clients using a single cephfs
> storing data + metadata on the same spinning disks with bluestore.
> 
> Test : I'm using a single rsync on one of the client servers (the
> other
> 3 are just sitting there). rsync is local to the client, copying from
> the cephfs mount (kernel client on 4.14 from stretch-backports, just
> to
> use a potentially more recent cephfs client than on stock 4.9), to
> the
> SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
> deep directory branches, along with some large files (10-100MB) in a
> few directories. There is no other activity on the cluster.
> 
> Observations : I initially saw write performance on the destination
> disk from a few 100kBps (during exploration of branches with tiny
> file)
> to a few 10MBps (while copying large files), essentially seeing the
> file names scrolling at a relatively fixed rate, unrelated to their
> individual size.
> After 5 hours, the fibre link stated to saturate at 200Mbps, while
> destination disk writes is down to a few 10kBps.
> 
> Using the dashboard, I see lots of metadata writes, at 30MBps rate on
> the metadata pool, which correlates to the 200Mbps link rate.
> It also shows regular "Health check failed: 1 MDSs behind on trimming
> (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming
> (64/30)".
> 
> I wonder why cephfs would write anything to the metadata (I'm
> mounting
> on the clients with "noatime"), while I'm just reading data from
> it...
> What could I tune to reduce that write-load-while-reading-only ?
> 
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Nicolas Huillard
Associé fondateur - Directeur Technique - Dolomède

nhuill...@dolomede.fr
Fixe : +33 9 52 31 06 10
Mobile : +33 6 50 27 69 08
http://www.dolomede.fr/

https://reseauactionclimat.org/planetman/
http://climat-2020.eu/
http://www.350.org/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
I experienced the same issue and was able to reduce metadata writes by raising 
mds_log_events_per_segment to it’s original value multiplied several times.

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Nicolas 
Huillard <nhuill...@dolomede.fr>
Sent: Monday, March 19, 2018 12:01:09 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Huge amount of cephfs metadata writes while only reading 
data (rsync from storage, to single disk)

Hi all,

I'm experimenting with a new little storage cluster. I wanted to take
advantage of the week-end to copy all data (1TB, 10M objects) from the
cluster to a single SATA disk. I expected to saturate the SATA disk
while writing to it, but the storage cluster actually saturates its
network links, while barely writing to the destination disk (63GB
written in 20h, that's less than 1MBps).

Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
between datacenters (12ms latency). 4 clients using a single cephfs
storing data + metadata on the same spinning disks with bluestore.

Test : I'm using a single rsync on one of the client servers (the other
3 are just sitting there). rsync is local to the client, copying from
the cephfs mount (kernel client on 4.14 from stretch-backports, just to
use a potentially more recent cephfs client than on stock 4.9), to the
SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
deep directory branches, along with some large files (10-100MB) in a
few directories. There is no other activity on the cluster.

Observations : I initially saw write performance on the destination
disk from a few 100kBps (during exploration of branches with tiny file)
to a few 10MBps (while copying large files), essentially seeing the
file names scrolling at a relatively fixed rate, unrelated to their
individual size.
After 5 hours, the fibre link stated to saturate at 200Mbps, while
destination disk writes is down to a few 10kBps.

Using the dashboard, I see lots of metadata writes, at 30MBps rate on
the metadata pool, which correlates to the 200Mbps link rate.
It also shows regular "Health check failed: 1 MDSs behind on trimming
(MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".

I wonder why cephfs would write anything to the metadata (I'm mounting
on the clients with "noatime"), while I'm just reading data from it...
What could I tune to reduce that write-load-while-reading-only ?

--
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Gregory Farnum
The MDS has to write to its local journal when clients open files, in
case of certain kinds of failures.

I guess it doesn't distinguish between read-only (when it could
*probably* avoid writing them down? Although it's not as simple a
thing as it sounds) and writeable file opens. So every file you're
opening requires the MDS to commit to disk, and it apparently filled
up its allowable mds log size and now you're stuck on that inter-DC
link. A temporary workaround might be to just keep turning up the mds
log sizes, but I'm sort of surprised it was absorbing stuff at a
useful rate before, so I don't know if changing those will help or
not.
-Greg

On Mon, Mar 19, 2018 at 5:01 PM, Nicolas Huillard  wrote:
> Hi all,
>
> I'm experimenting with a new little storage cluster. I wanted to take
> advantage of the week-end to copy all data (1TB, 10M objects) from the
> cluster to a single SATA disk. I expected to saturate the SATA disk
> while writing to it, but the storage cluster actually saturates its
> network links, while barely writing to the destination disk (63GB
> written in 20h, that's less than 1MBps).
>
> Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
> 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
> between datacenters (12ms latency). 4 clients using a single cephfs
> storing data + metadata on the same spinning disks with bluestore.
>
> Test : I'm using a single rsync on one of the client servers (the other
> 3 are just sitting there). rsync is local to the client, copying from
> the cephfs mount (kernel client on 4.14 from stretch-backports, just to
> use a potentially more recent cephfs client than on stock 4.9), to the
> SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
> deep directory branches, along with some large files (10-100MB) in a
> few directories. There is no other activity on the cluster.
>
> Observations : I initially saw write performance on the destination
> disk from a few 100kBps (during exploration of branches with tiny file)
> to a few 10MBps (while copying large files), essentially seeing the
> file names scrolling at a relatively fixed rate, unrelated to their
> individual size.
> After 5 hours, the fibre link stated to saturate at 200Mbps, while
> destination disk writes is down to a few 10kBps.
>
> Using the dashboard, I see lots of metadata writes, at 30MBps rate on
> the metadata pool, which correlates to the 200Mbps link rate.
> It also shows regular "Health check failed: 1 MDSs behind on trimming
> (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".
>
> I wonder why cephfs would write anything to the metadata (I'm mounting
> on the clients with "noatime"), while I'm just reading data from it...
> What could I tune to reduce that write-load-while-reading-only ?
>
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
Hi all,

I'm experimenting with a new little storage cluster. I wanted to take
advantage of the week-end to copy all data (1TB, 10M objects) from the
cluster to a single SATA disk. I expected to saturate the SATA disk
while writing to it, but the storage cluster actually saturates its
network links, while barely writing to the destination disk (63GB
written in 20h, that's less than 1MBps).

Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
between datacenters (12ms latency). 4 clients using a single cephfs
storing data + metadata on the same spinning disks with bluestore.

Test : I'm using a single rsync on one of the client servers (the other
3 are just sitting there). rsync is local to the client, copying from
the cephfs mount (kernel client on 4.14 from stretch-backports, just to
use a potentially more recent cephfs client than on stock 4.9), to the
SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
deep directory branches, along with some large files (10-100MB) in a
few directories. There is no other activity on the cluster.

Observations : I initially saw write performance on the destination
disk from a few 100kBps (during exploration of branches with tiny file)
to a few 10MBps (while copying large files), essentially seeing the
file names scrolling at a relatively fixed rate, unrelated to their
individual size.
After 5 hours, the fibre link stated to saturate at 200Mbps, while
destination disk writes is down to a few 10kBps.

Using the dashboard, I see lots of metadata writes, at 30MBps rate on
the metadata pool, which correlates to the 200Mbps link rate.
It also shows regular "Health check failed: 1 MDSs behind on trimming
(MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".

I wonder why cephfs would write anything to the metadata (I'm mounting
on the clients with "noatime"), while I'm just reading data from it...
What could I tune to reduce that write-load-while-reading-only ?

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com