Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Benedikt Fraunhofer
Hi Tom, > We have been seeing this same behavior on a cluster that has been perfectly > happy until we upgraded to the ubuntu vivid 3.19 kernel. We are in the i can't recall when we gave 3.19 a shot but now that you say it... The cluster was happy for >9 months with 3.16. Did you try 4.2 or do

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Benedikt Fraunhofer
Hi Tom, 2015-12-08 10:34 GMT+01:00 Tom Christensen : > We didn't go forward to 4.2 as its a large production cluster, and we just > needed the problem fixed. We'll probably test out 4.2 in the next couple unfortunately we don't have the luxury of a test cluster. and to add to

[ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hello Cephers, lately, our ceph-cluster started to show some weird behavior: the osd boxes show a load of 5000-15000 before the osds get marked down. Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly, you can read and write to any disk, only things you can't do are

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
kernel.pid_max = 4194304 > (I think it also sets this as well: kernel.threads-max = 4194304) > > I think you are running out of processs IDs. > > Jan > >> On 08 Dec 2015, at 08:10, Benedikt Fraunhofer <fraunho...@traced.net> wrote: >> >> Hello Cephers,

Re: [ceph-users] after loss of journal, osd fails to start with failed assert OSDMapRef OSDService::get_map(epoch_t) ret != null

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, 2015-12-08 8:12 GMT+01:00 Jan Schermer : > Journal doesn't just "vanish", though, so you should investigate further... We tried putting journals as files to overcome the changes in ceph-deploy where you can't have the journals unencrypted but only the disks itself.

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, we had 65k for pid_max, which made kernel.threads-max = 1030520. or kernel.threads-max = 256832 (looks like it depends on the number of cpus?) currently we've root@ceph1-store209:~# sysctl -a | grep -e thread -e pid kernel.cad_pid = 1 kernel.core_uses_pid = 0 kernel.ns_last_pid = 60298

[ceph-users] osd dies on pg repair with FAILED assert(!out->snaps.empty())

2015-12-07 Thread Benedikt Fraunhofer
Hello Cephers! trying to repair an inconsistent PG results in the osd dying with an assertion failure: 0> 2015-12-01 07:22:13.398006 7f76d6594700 -1 osd/SnapMapper.cc: In function 'int SnapMapper::get_snaps(const hobject_t& , SnapMapper::object_snaps*)' thread 7f76d6594700 time 2015-12-01

[ceph-users] after loss of journal, osd fails to start with failed assert OSDMapRef OSDService::get_map(epoch_t) ret != null

2015-12-07 Thread Benedikt Fraunhofer
Hello List, after some crash of a box, the journal vanished. Creating a new one with --mkjournal results in the osd beeing unable to start. Does anyone want to dissect this any further or should I just trash the osd and recreate it? Thx in advance Benedikt 2015-12-01 07:46:31.505255

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
get any data to work with. Thx Benedikt 2015-12-08 8:44 GMT+01:00 Jan Schermer <j...@schermer.cz>: > > Jan > > >> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer <fraunho...@traced.net> wrote: >> >> Hi Jan, >> >> we had 65k for pid_max, which

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Benedikt Fraunhofer
Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the

Re: [ceph-users] How to calculate file size when mount a block device from rbd image

2014-10-20 Thread Benedikt Fraunhofer
Hi Mika, 2014-10-20 11:16 GMT+02:00 Vickie CH mika.leaf...@gmail.com: 2.Use dd command to create a 1.2T file. #dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000 I think you're off by one zero 12288000/1024/1024 11 Means you're instructing it to create a 11TB file on a 1.5T

[ceph-users] 2 pgs stuck in active+clean+inconsistent

2014-06-06 Thread Benedikt Fraunhofer
Hello List, the other day when i looked at our ceph cluster it showed: health HEALTH_ERR 135 pgs inconsistent; 1 pgs recovering; recovery 76/4633296 objects degraded (0.002%); 169 scrub errors; clock skew detected on mon.mon2-nb8 I did a ceph pg dump | grep -i incons | cut -f 1 | while

Re: [ceph-users] 2 pgs stuck in active+clean+inconsistent

2014-06-06 Thread Benedikt Fraunhofer
2014-06-06 9:18 GMT+02:00 Benedikt Fraunhofer given.to.lists.ceph-users.ceph.com.toasta@traced.net: Hello List, and it logs nothing in ceph -w when i issue ceph pg repair 2.c1 instructing pg 2.c1 on osd.51 to repair ceph pg repair 2.68 instructing pg 2.68 on osd.69 to repair

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-04 Thread Benedikt Fraunhofer
2014-04-04 0:31 GMT+02:00 Yehuda Sadeh yeh...@inktank.com: Hi Yehuda, sorry for the delay. We ran into another problem and this took up all the time. Are you running the version off the master branch, or did you just cherry-pick the patch? I can't seem to reproduce the problem. I just patched

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-02 Thread Benedikt Fraunhofer
Hi Yehuda, i tried your patch and it feels fine, except you might need some special handling for those already corrupt uploads, as trying to delete them gets radosgw in an endless loop and high cpu usage: 2014-04-02 11:03:15.045627 7fbf157d2700 0 RGWObjManifest::operator++(): result:

[ceph-users] radosgw multipart-uploaded downloads fail

2014-04-01 Thread Benedikt Fraunhofer
Hello everyone, I can't download anything that's been uploaded as a multipart-upload. I'm on 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) on a non-ec-pool. The upload is acknowledged as beeing ok 2014-03-31 14:56:56.722727 7f4080ff9700 2 req 8:0.023285:s3:POST

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-01 Thread Benedikt Fraunhofer
Hi Yehuda, 2014-04-01 15:49 GMT+02:00 Yehuda Sadeh yeh...@inktank.com: It could be the gateway's fault, might be related to the new manifest that went in before 0.78. I'll need more logs though, can you reproduce with 'debug ms = 1', and 'debug rgw = 20', and provide a log for all the