Re: [ceph-users] New cephfs cluster performance issues- Jewel - cachepressure, capability release, poor iostat await avg queue size

2016-10-19 Thread mykola.dvornik
Not sure if related, but I see the same issue on the very different 
hardware/configuration. In particular on large data transfers OSDs become slow 
and blocking. Iostat await on spinners can go up to 6(!) s ( journal is on the 
ssd). Looking closer on those spinners with blktrace suggest that most of those 
6 s the io requests spend in the que, before get committed to the driver and 
eventually written to disk. Tried different io schedulers, played with their 
parameters, but nothing helps. Unfortunately, blktrace is a very nasty thing 
that fails to start at some point until machine is rebooted. So I am still 
waiting for the appropriate time slot to reboot the OSD nodes and record io 
with blktrace again. 

-Mykola

From: John Spray
Sent: Wednesday, 19 October 2016 19:17
To: Jim Kilborn
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
cachepressure, capability release, poor iostat await avg queue size

On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn  wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail for Windows 10
>
>
>
> From: John Spray
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.33 5.9621.03   471.20   
>>  67.86  910.95   87.00 1193.99   8.27  97.07
>>   0.0016.67   33.00  289.33 4.2118.80   146.20   
>>  29.83   88.99   93.91   88.43   3.10  99.83
>>   0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>>
>>
>> If I look at the iostat for all the drives, only the cache ssd drive is 
>> backed up
>>
>> Device:   rrqm/s   wrqm/s r/s  

Re: [ceph-users] cephfs slow delete

2016-10-14 Thread mykola.dvornik
I was doing parallel deletes until the point when there are >1M objects in the 
stry. Then delete fails with ‘no space left’ error. If one would deep-scrub 
those pgs containing corresponidng metadata, they turn to be inconsistent. In 
worst case one would get virtually empty folders that have size of 16EB. Those 
are impossible to delete as they are ‘non empty’. 

-Mykola

From: Gregory Farnum
Sent: Saturday, 15 October 2016 05:02
To: Mykola Dvornik
Cc: Heller, Chris; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs slow delete

On Fri, Oct 14, 2016 at 6:26 PM,   wrote:
> If you are running 10.2.3 on your cluster, then I would strongly recommend
> to NOT delete files in parallel as you might hit
> http://tracker.ceph.com/issues/17177

I don't think these have anything to do with each other. What gave you
the idea simultaneous deletes could invoke that issue?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs slow delete

2016-10-14 Thread mykola.dvornik
If you are running 10.2.3 on your cluster, then I would strongly recommend to 
NOT delete files in parallel as you might hit 
http://tracker.ceph.com/issues/17177

-Mykola

From: Heller, Chris
Sent: Saturday, 15 October 2016 03:36
To: Gregory Farnum
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cephfs slow delete

Just a thought, but since a directory tree is a first class item in cephfs, 
could the wire protocol be extended with an “recursive delete” operation, 
specifically for cases like this?

On 10/14/16, 4:16 PM, "Gregory Farnum"  wrote:

On Fri, Oct 14, 2016 at 1:11 PM, Heller, Chris  wrote:
> Ok. Since I’m running through the Hadoop/ceph api, there is no syscall 
boundary so there is a simple place to improve the throughput here. Good to 
know, I’ll work on a patch…

Ah yeah, if you're in whatever they call the recursive tree delete
function you can unroll that loop a whole bunch. I forget where the
boundary is so you may need to go play with the JNI code; not sure.
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: No space left on device

2016-10-06 Thread mykola.dvornik
Is there any way to repair pgs/cephfs gracefully?

-Mykola

From: Yan, Zheng
Sent: Thursday, 6 October 2016 04:48
To: Mykola Dvornik
Cc: John Spray; ceph-users
Subject: Re: [ceph-users] CephFS: No space left on device

On Wed, Oct 5, 2016 at 2:27 PM, Mykola Dvornik  wrote:
> Hi Zheng,
>
> Many thanks for you reply.
>
> This indicates the MDS metadata is corrupted. Did you do any unusual
> operation on the cephfs? (e.g reset journal, create new fs using
> existing metadata pool)
>
> No, nothing has been explicitly done to the MDS. I had a few inconsistent
> PGs that belonged to the (3 replica) metadata pool. The symptoms were
> similar to http://tracker.ceph.com/issues/17177 . The PGs were eventually
> repaired and no data corruption was expected as explained in the ticket.
>

I'm afraid that issue does cause corruption.

> BTW, when I posted this issue on the ML the amount of ground state stry
> objects was around 7.5K. Now it went up to 23K. No inconsistent PGs or any
> other problems happened to the cluster within this time scale.
>
> -Mykola
>
> On 5 October 2016 at 05:49, Yan, Zheng  wrote:
>>
>> On Mon, Oct 3, 2016 at 5:48 AM, Mykola Dvornik 
>> wrote:
>> > Hi Johan,
>> >
>> > Many thanks for your reply. I will try to play with the mds tunables and
>> > report back to your ASAP.
>> >
>> > So far I see that mds log contains a lot of errors of the following
>> > kind:
>> >
>> > 2016-10-02 11:58:03.002769 7f8372d54700  0 mds.0.cache.dir(100056ddecd)
>> > _fetched  badness: got (but i already had) [inode 10005729a77 [2,head]
>> > ~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728
>> > 1=1+0)
>> > (iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07
>> > 23:06:29.776298
>> >
>> > 2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log
>> > [ERR] :
>> > loaded dup inode 10005729a77 [2,head] v68621 at
>> >
>> > /users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/j002654.out/m_xrange192-320_yrange192-320_016232.dump,
>> > but inode 10005729a77.head v67464942 already exists at
>> > ~mds0/stray1/10005729a77
>>
>> This indicates the MDS metadata is corrupted. Did you do any unusual
>> operation on the cephfs? (e.g reset journal, create new fs using
>> existing metadata pool)
>>
>> >
>> > Those folders within mds.0.cache.dir that got badness report a size of
>> > 16EB
>> > on the clients. rm on them fails with 'Directory not empty'.
>> >
>> > As for the "Client failing to respond to cache pressure", I have 2
>> > kernel
>> > clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the
>> > most
>> > recent release version of ceph-fuse. The funny thing is that every
>> > single
>> > client misbehaves from time to time. I am aware of quite discussion
>> > about
>> > this issue on the ML, but cannot really follow how to debug it.
>> >
>> > Regards,
>> >
>> > -Mykola
>> >
>> > On 2 October 2016 at 22:27, John Spray  wrote:
>> >>
>> >> On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
>> >>  wrote:
>> >> > After upgrading to 10.2.3 we frequently see messages like
>> >>
>> >> From which version did you upgrade?
>> >>
>> >> > 'rm: cannot remove '...': No space left on device
>> >> >
>> >> > The folders we are trying to delete contain approx. 50K files 193 KB
>> >> > each.
>> >>
>> >> My guess would be that you are hitting the new
>> >> mds_bal_fragment_size_max check.  This limits the number of entries
>> >> that the MDS will create in a single directory fragment, to avoid
>> >> overwhelming the OSD with oversized objects.  It is 10 by default.
>> >> This limit also applies to "stray" directories where unlinked files
>> >> are put while they wait to be purged, so you could get into this state
>> >> while doing lots of deletions.  There are ten stray directories that
>> >> get a roughly even share of files, so if you have more than about one
>> >> million files waiting to be purged, you could see this condition.
>> >>
>> >> The "Client failing to respond to cache pressure" messages may play a
>> >> part here -- if you have misbehaving clients then they may cause the
>> >> MDS to delay purging stray files, leading to a backlog.  If your
>> >> clients are by any chance older kernel clients, you should upgrade
>> >> them.  You can also unmount/remount them to clear this state, although
>> >> it will reoccur until the clients are updated (or until the bug is
>> >> fixed, if you're running latest clients already).
>> >>
>> >> The high level counters for strays are part of the default output of
>> >> "ceph daemonperf mds." when run on the MDS server (the "stry" and
>> >> "purg" columns).  You can look at these to watch how fast the MDS is
>> >> clearing out strays.  If your backlog is just because it's not doing
>> >> it fast enough, then you can look at tuning mds_max_purge_files and
>> >> mds_max_purge_ops to adjust the throttles on purging.  Those settings
>> >> can