[ceph-users] 2 pgs stuck in active+clean+inconsistent

2014-06-06 Thread Benedikt Fraunhofer
Hello List, the other day when i looked at our ceph cluster it showed: health HEALTH_ERR 135 pgs inconsistent; 1 pgs recovering; recovery 76/4633296 objects degraded (0.002%); 169 scrub errors; clock skew detected on mon.mon2-nb8 I did a ceph pg dump | grep -i incons | cut -f 1 | while r

Re: [ceph-users] 2 pgs stuck in active+clean+inconsistent

2014-06-06 Thread Benedikt Fraunhofer
2014-06-06 9:18 GMT+02:00 Benedikt Fraunhofer : Hello List, > and it logs nothing in "ceph -w" when i issue > > ceph pg repair 2.c1 > instructing pg 2.c1 on osd.51 to repair > ceph pg repair 2.68 > instructing pg 2.68 on osd.69 to repair Rebooting the hosts

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Benedikt Fraunhofer
Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the

Re: [ceph-users] How to calculate file size when mount a block device from rbd image

2014-10-20 Thread Benedikt Fraunhofer
Hi Mika, 2014-10-20 11:16 GMT+02:00 Vickie CH : > 2.Use dd command to create a 1.2T file. >#dd if=/dev/zero of=/mnt/ceph-mount/test12T bs=1M count=12288000 I think you're off by one "zero" 12288000/1024/1024 11 Means you're instructing it to create a 11TB file on a 1.5T volume. Cheers

[ceph-users] radosgw multipart-uploaded downloads fail

2014-04-01 Thread Benedikt Fraunhofer
Hello everyone, I can't download anything that's been uploaded as a multipart-upload. I'm on 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4) on a non-ec-pool. The upload is acknowledged as beeing ok 2014-03-31 14:56:56.722727 7f4080ff9700 2 req 8:0.023285:s3:POST /file1:complete_multipart:http

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-01 Thread Benedikt Fraunhofer
Hi Yehuda, > 2014-04-01 15:49 GMT+02:00 Yehuda Sadeh : > It could be the gateway's fault, might be related to the new manifest > that went in before 0.78. I'll need more logs though, can you > reproduce with 'debug ms = 1', and 'debug rgw = 20', and provide a log > for all the upload and for the

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-02 Thread Benedikt Fraunhofer
Hi Yehuda, i tried your patch and it feels fine, except you might need some special handling for those already corrupt uploads, as trying to delete them gets radosgw in an endless loop and high cpu usage: 2014-04-02 11:03:15.045627 7fbf157d2700 0 RGWObjManifest::operator++(): result: ofs=3355443

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-04 Thread Benedikt Fraunhofer
2014-04-04 0:31 GMT+02:00 Yehuda Sadeh : Hi Yehuda, sorry for the delay. We ran into another problem and this took up all the time. >> Are you running the version off the master branch, or did you just >> cherry-pick the patch? I can't seem to reproduce the problem. I just patched that line in a

Re: [ceph-users] radosgw multipart-uploaded downloads fail

2014-04-08 Thread Benedikt Fraunhofer
Hello Yehuda, 2014-04-04 9:35 GMT+02:00 Benedikt Fraunhofer : > I'm currently trying to find a faster box with enough free space to > reproduce that and capture logs. Here's the complete log with "debug rgw 20" and "debug ms 1" of a failed large-ish multi

[ceph-users] after loss of journal, osd fails to start with failed assert OSDMapRef OSDService::get_map(epoch_t) ret != null

2015-12-07 Thread Benedikt Fraunhofer
Hello List, after some crash of a box, the journal vanished. Creating a new one with --mkjournal results in the osd beeing unable to start. Does anyone want to dissect this any further or should I just trash the osd and recreate it? Thx in advance Benedikt 2015-12-01 07:46:31.505255 7fadb7f1e9

[ceph-users] osd dies on pg repair with FAILED assert(!out->snaps.empty())

2015-12-07 Thread Benedikt Fraunhofer
Hello Cephers! trying to repair an inconsistent PG results in the osd dying with an assertion failure: 0> 2015-12-01 07:22:13.398006 7f76d6594700 -1 osd/SnapMapper.cc: In function 'int SnapMapper::get_snaps(const hobject_t& , SnapMapper::object_snaps*)' thread 7f76d6594700 time 2015-12-01 07

[ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hello Cephers, lately, our ceph-cluster started to show some weird behavior: the osd boxes show a load of 5000-15000 before the osds get marked down. Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly, you can read and write to any disk, only things you can't do are strace

Re: [ceph-users] after loss of journal, osd fails to start with failed assert OSDMapRef OSDService::get_map(epoch_t) ret != null

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, 2015-12-08 8:12 GMT+01:00 Jan Schermer : > Journal doesn't just "vanish", though, so you should investigate further... We tried putting journals as files to overcome the changes in ceph-deploy where you can't have the journals unencrypted but only the disks itself. (and/or you can't have

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
= 4194304 > (I think it also sets this as well: kernel.threads-max = 4194304) > > I think you are running out of processs IDs. > > Jan > >> On 08 Dec 2015, at 08:10, Benedikt Fraunhofer wrote: >> >> Hello Cephers, >> >> lately, our ceph-cluster started t

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, we had 65k for pid_max, which made kernel.threads-max = 1030520. or kernel.threads-max = 256832 (looks like it depends on the number of cpus?) currently we've root@ceph1-store209:~# sysctl -a | grep -e thread -e pid kernel.cad_pid = 1 kernel.core_uses_pid = 0 kernel.ns_last_pid = 60298 k

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
s case - don't get any data to work with. Thx Benedikt 2015-12-08 8:44 GMT+01:00 Jan Schermer : > > Jan > > >> On 08 Dec 2015, at 08:41, Benedikt Fraunhofer wrote: >> >> Hi Jan, >> >> we had 65k for pid_max, which made >> kernel.threads-max

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Benedikt Fraunhofer
Hi Tom, > We have been seeing this same behavior on a cluster that has been perfectly > happy until we upgraded to the ubuntu vivid 3.19 kernel. We are in the i can't recall when we gave 3.19 a shot but now that you say it... The cluster was happy for >9 months with 3.16. Did you try 4.2 or do y

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Benedikt Fraunhofer
Hi Tom, 2015-12-08 10:34 GMT+01:00 Tom Christensen : > We didn't go forward to 4.2 as its a large production cluster, and we just > needed the problem fixed. We'll probably test out 4.2 in the next couple unfortunately we don't have the luxury of a test cluster. and to add to that, we couldnt s