Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device

2016-12-06 Thread Hauke Homburg
Am 10.10.2016 um 10:05 schrieb Hauke Homburg:
> Am 07.10.2016 um 17:37 schrieb Gregory Farnum:
>> On Fri, Oct 7, 2016 at 7:15 AM, Hauke Homburg  
>> wrote:
>>> Hello,
>>>
>>> I have a Ceph Cluster with 5 Server, and 40 OSD. Aktual on this Cluster
>>> are 85GB Free Space, and the rsync dir has lots of Pictures and a Data
>>> Volume of 40GB.
>>>
>>> The Linux is a Centos 7 and the Last stable Ceph. The Client is a Debian
>>> 8 with Kernel 4 and the Cluster is with cephfs mounted.
>>>
>>> When i sync the Directory i see often the Message rsync mkstemp no space
>>> left on device (28). At this Point i can touch a File in anotherDiretory
>>> in the Cluster. In the Diretory i have ~ 63 Files. Are this too much
>>> Files?
>> Yes, in recent releases CephFS limits you to 100k dentries in a single
>> directory fragment. This *includes* the "stray" directories that files
>> get moved into when you unlink them, and is intended to prevent issues
>> with very large folders. It will stop being a problem once we enable
>> automatic fragmenting (soon, hopefully).
>> You can change that by changing the "mds bal fragment size max"
>> config, but you're probably better off by figuring out if you've got
>> an over-large directory or if you're deleting files faster than the
>> cluster can keep up. There was a thread about this very recently and
>> John included some details about tuning if you check the archives. :)
>> -Greg
> Hello,
>
> Thanks for the answer.
> I enabled on the Cluster the mds bal frag = true Options.
>
> Today i read that i have to enable this option on the Client, too. With
> a Fuse mount i can do it with the ceph Binary. I use the Kernel Module.
> How can i do it there?
>
> Regards
>
> Hauke
>

Hello,

After some Discussion in our Team we have deleted die Cephfs and
switched to rados with ext4.

Now we want to realive the Setup:

1 Ceph Cluster Jewel 10.0.2.3
5 Server with Ceph 10.0.2.3 Client with rados installed. We map all 5
Mons of our Cluster into every rbd map call. To have a Failover.

Aktually we have the Problem, that we can store Data into The Cluster
with rsync, but when rsync is deleting Files, ext4 becomes Filesystem
errors.

I undersstand Ceph with rbd so, that i can use Ceph als Cluster
Filesystem like ocfs2. So i don't unterstand, why i have Filesytem Errors.

I read in some Postings here, Ceph needs Filesystem Locking like DLM. Is
this true? In the aktual Version Jewel? Doesn't do libceph this Locking?

Thanks for Help

Hauke

-- 
www.w3-creative.de

www.westchat.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Interpretation Guidance for Slow Requests

2016-12-06 Thread Christian Balzer

Hello,

On Tue, 6 Dec 2016 20:58:52 +0100 Christian Theune wrote:

> Hi,
> 
> > On 6 Dec 2016, at 04:42, Christian Balzer  wrote:
> > Jewel issues, like the most recent one with scrub sending OSDs to
> > neverland.
> 
> Alright. We’re postponing this for now. Is that actually a more widespread 
> assumption that Jewel has “prime time” issues?
>
You ask the people here running into issues or patching stuff.

The upgrade is far more complex (ownerships) and with potential for things
to go wrong than I remember with any other release in the last 3 years.
Stuff like upgrading OSDs before MONs upsetting the muscle memory of all
long time operators.

The new defaults are likely to break things unless you prepared for them
and/or have all your clients upgraded at the same time (mission impossible
for anybody running long-term VMs).

Loads of things changed, some massively like cache-tiering, with poor or
no documentation (other than source code, obscure changelog entries).
 
I'm pondering to let my one cluster/unit die from natural causes
while still running Hammer after the HW is depreciated in 3 years.

>  We started adding pure-SSD OSDs in the last days (based on MICRON
> >>> S610DC-3840) and the slow requests we’ve seen in the past have started
> >>> to show a different pattern.
>  
> >>> I looked in the archives and can't find a full description of your
> >>> cluster hardware in any posts from you or the other Christian (hint,
> >>> hint). Slow requests can nearly all the time being traced back to HW
> >>> issues/limitations.
> >> 
> >> We’re currently running on 42% Christians. ;)
> >> 
> >> We currently have hosts of 2 generations in our cluster. They’re both
> >> SuperMicro, sold by Thomas Krenn.
> >> 
> >> Type 1 (4 Hosts)
> >> 
> >>SuperMicro X9DR3-F
> >>64 GiB RAM
> >>2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> > 
> > A bit on the weak side, if you'd be doing lots of small IOPS, as you found
> > out later. Not likely a real problem here, though.
> 
> We do have a good number of cores, so I wonder whether increasing the op 
> thread count would help? However, the Intel pure SSD pool doesn’t show that 
> issue at all.
>
Nope, as I said, CPUs aren't your problem.

> > 
> >>LSI MegaRAID SAS 9271-8i
> > And as you already found out, not a good match for Ceph (unless you actual
> > do RAIDs, like me with Areca)
> > Also LSI tends to have issues unless you're on the latest drivers
> > (firmware and kernel side)
> 
> Yeah. I was leaning towards using the Adaptec HBA, but currently I’m leaning 
> for the “devil we know”. Care to weigh in?
> 
Google. ^o^
http://www.spinics.net/lists/ceph-users/msg24370.html

I've had a comparable (according to specs) Adaptec controller
(HBA) performing abysmal (40% slower) compared to the LSI equivalent.

> Adaptec/LSI are our vendor provided choices. May I ask what your setup is 
> regarding RAID/Areca?
> 
Also discussed here plenty times.
4GB HW cache, OSDs are 4 disk RAID10s, thus replica of 2 (effectively 4).
Special use case, probably not for general audience.

> >>Dual 10-Gigabit X540-AT2
> >>OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
> >>Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
> >> should be DCP3700, maybe my inventory is wrong)
> >> 
> > That needs clarification, which ones are they now?
> 
> I noticed we may have a mixed setup. I’ll need to go through this on our 
> machines in detail. I’ll follow up on that.
> 
> >>Pool “rbd.ssd”
> >> 
> >>2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> >> 
> > You're clearly a much braver man than me or have a very light load (but at
> > least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
> > Please clarify, journals inline (as you suggest below) or also on the NVME?
> 
> Journals inline.
> 
Good.
And I brain-farted, I was thinking of 3520s, the 3510s still have 0.3
DWPD, so 0.15 DWPD after journals at best.
Danger Will Robinson!

> >>Pool “rbd.hdd”
> >> 
> >>5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
> >>1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> > Same here, journal inline or also on the NVME?
> 
> Journals inline. I’m considering to run an experiment with moving the journal 
> to NVME, but: I ran iostat on the raw device as well as the mapped LVM 
> volumes for OSD/Journal and did not see high wait times on the journal, but 
> on the OSD. Also, when I tried Sebastian’s fio setup (on the OSDs file 
> system, not on the raw device as that’s in use right now) I got reasonable 
> numbers. However, this might be a stupid test as I read that XFS might be 
> ignoring the DSYNC request from fio.
> 

As I wrote below, not a fan of LVM in these use cases. 
How do you know that the high waits on the "OSD" part of the disk isn't
caused by the SSD melting from the "fast" SYNC writes to the journal part?

Definitely move the journal to the NVMe, that will settle this question.

[ceph-users] best radosgw performance ?

2016-12-06 Thread Z Will
Hi:
I want to know the best radosgw  performance in practice now. What
is best W/ iops ? If I have  10 concurrent PUTs . The files are
different size. And for small files ,like <100k ,I hope the response
time is millisecond level. We now have a ceph cluster with 45 hosts,
540 osds. What should I supposed to do ?  How many radosgws should I
use ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Interpretation Guidance for Slow Requests

2016-12-06 Thread Christian Balzer

Hello,

On Tue, 6 Dec 2016 11:14:59 -0600 Reed Dier wrote:

> 
> > On Dec 5, 2016, at 9:42 PM, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > On Tue, 6 Dec 2016 03:37:32 +0100 Christian Theune wrote:
> > 
> >> Hi Christian (heh),
> >> 
> >> thanks for picking this up. :)
> >> 
> >> This has become a rather long post as I added more details and giving
> >> our history, but if we make progress then maybe this can help others in
> >> the future. I find slow requests extremely hard to debug and as I said:
> >> aside from scratching my own itch, I’d be happy to help future
> >> travellers.
> >> 
> >>> On 6 Dec 2016, at 00:59, Christian Balzer  wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:
> >>> 
>  Hi,
>  
>  we’re currently expanding our cluster to grow the number of IOPS we
>  can provide to clients. We’re still on Hammer but in the process of
>  upgrading to Jewel. 
> >>> 
> >>> You might want to wait until the next Jewel release, given the current
> >>> number of issues.
> >> 
> >> Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph
> >> cluster’s issues on Jewel is making me quite nervous and I’m
> >> re-evaluating postponing already.)
> >> 
> > Jewel issues, like the most recent one with scrub sending OSDs to
> > neverland.
> > 
>  We started adding pure-SSD OSDs in the last days (based on MICRON
> >>> S610DC-3840) and the slow requests we’ve seen in the past have started
> >>> to show a different pattern.
>  
> >>> I looked in the archives and can't find a full description of your
> >>> cluster hardware in any posts from you or the other Christian (hint,
> >>> hint). Slow requests can nearly all the time being traced back to HW
> >>> issues/limitations.
> >> 
> >> We’re currently running on 42% Christians. ;)
> >> 
> >> We currently have hosts of 2 generations in our cluster. They’re both
> >> SuperMicro, sold by Thomas Krenn.
> >> 
> >> Type 1 (4 Hosts)
> >> 
> >>SuperMicro X9DR3-F
> >>64 GiB RAM
> >>2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> > 
> > A bit on the weak side, if you'd be doing lots of small IOPS, as you found
> > out later. Not likely a real problem here, though.
> > 
> >>LSI MegaRAID SAS 9271-8i
> > And as you already found out, not a good match for Ceph (unless you actual
> > do RAIDs, like me with Areca)
> > Also LSI tends to have issues unless you're on the latest drivers
> > (firmware and kernel side)
> 
> Currently have our OSD’s behind an LSI 3108 RoC, in individual RAID 0 virtual 
> disks. Curious to see if this looks to be an issue in this case, as we also 
> see slow requests, though we are completely on-disk journaling, though 
> evaluating NVMe journaling possibly as soon as today.
>

I doubt it, the ones I'm thinking about should up noticeably in kernel logs
and dmesg. 
The caches on a RAID controller can very much help (like the native Ceph
journal), until the point where they're overwhelmed and things drop to the
native level of performance. 
But with on disk journals you're looking at a degradation due to the
double writes anyway, so that would be my suspect for slow ops.

Again, atop should answer that question pretty conclusively. 


> While not idea, they are BBU backed, which is the result of being hurt from a 
> power outage during PoC stage, when we used on-device write cache, and 
> corrupted the leveldb beyond repair.
> So while an HBA would be ‘ideal,’ the BBU backed RAID has added a level of 
> resiliency that could be helpful to some (verified with further destructive 
> testing both with disk cache enabled/disabled to verify reproducibility).
> 
Some controllers (supposedly) do HBA (IT) mode while still utilizing HW
cache RAM, you might want to google around for that.
Areca does, but admittedly I've never run any in that mode.

> > 
> >>Dual 10-Gigabit X540-AT2
> >>OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
> >>Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
> >> should be DCP3700, maybe my inventory is wrong)
> >> 
> > That needs clarification, which ones are they now?
> > 
> >>Pool “rbd.ssd”
> >> 
> >>2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
> >> 
> > You're clearly a much braver man than me or have a very light load (but at
> > least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
> > Please clarify, journals inline (as you suggest below) or also on the NVME?
> > 
> >>Pool “rbd.hdd”
> >> 
> >>5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
> >>1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> > Same here, journal inline or also on the NVME?
> > 
> >> 
> >> Type 2 (2 Hosts)
> >> 
> >>SuperMicro X9SRL-F
> >>32 GiB RAM
> >>2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
> >>LSI MegaRAID SAS 9260-4i
> >>Dual 10-Gigabit X540-AT2
> >>OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
> >>Journal 

Re: [ceph-users] Hello Jason, Could you help to have a look at this RBD segmentation fault?

2016-12-06 Thread Jason Dillaman
CCing in ceph-users:

That is a pretty old version of fio and I know a couple rbd-related
bugs / crashes have been fixed since fio 2.2.8. Can you retry using a
more up-to-date version of fio?

On Tue, Dec 6, 2016 at 2:40 AM,   wrote:
> Hello Jason,
>
> I'm from ZTE corporation, and we are using ceph as the backend storage of
> OpenStack.
>
> Sorry to bother you, but we stuck in the following exception when I run fio
> tool to test rbd image.
>
> Could you help to take a look?  thanks.
>
>
> ---
>
> [Thread debugging using libthread_db enabled]
>
> Using host libthread_db library "/lib64/libthread_db.so.1".
>
> Core was generated by `fio fio.conf6'.
>
> Program terminated with signal 11, Segmentation fault.
>
> #0  0x7ff841d6981e in Mutex::Lock (this=0x7ff818060800,
> no_lockdep=<optimized out>) at common/Mutex.cc:97
>
> 97if (logger && cct && cct->_conf->mutex_perf_counter) {
>
> Missing separate debuginfos, use: debuginfo-install fio-2.2.8-2.el7.x86_64
>
> (gdb) bt
>
> #0  0x7ff841d6981e in Mutex::Lock (this=0x7ff818060800,
> no_lockdep=<optimized out>) at common/Mutex.cc:97
>
> #1  0x7ff841b981f0 in librbd::AioCompletion::get_return_value
> (this=0x7ff818060800) at librbd/AioCompletion.cc:217
>
> #2  0x00452174 in _fio_rbd_finish_aiocb ()
>
> #3  0x7ff841b97b1e in librbd::AioCompletion::complete
> (this=this@entry=0x7ff818079570) at librbd/AioCompletion.cc:107
>
> #4  0x7ff841b99027 in librbd::AioCompletion::complete_request
> (this=0x7ff818079570, r=<optimized out>)
>
> at librbd/AioCompletion.cc:193
>
> #5  0x7ff841b962c9 in Context::complete (this=0x7ff7f1375130,
> r=<optimized out>) at include/Context.h:64
>
> #6  0x7ff841be23e7 in delete_me (this=0x7ff7f379d870) at
> include/Context.h:271
>
> #7  C_GatherBase<Context, Context>::sub_finish (this=0x7ff7f379d870,
> sub=0x7ff7f01cdb10, r=0) at include/Context.h:266
>
> #8  0x7ff841be2502 in C_GatherBase<Context,
> Context>::C_GatherSub::finish (this=0x7ff7f01cdb10, r=<optimized out>)
>
> at include/Context.h:289
>
> #9  0x7ff841bd29f9 in complete (r=<optimized out>, this=0x7ff7f01cdb10)
> at include/Context.h:64
>
> #10 C_GatherBase<Context, Context>::C_GatherSub::complete
> (this=0x7ff7f01cdb10, r=<optimized out>) at include/Context.h:286
>
> Python Exception <type 'exceptions.IndexError'> list index out of range:
>
> #11 0x7ff8420193cc in finish_contexts<Context> (cct=0x7ff818006d50,
> finished=empty std::list, result=result@entry=0)
>
> at include/Context.h:120
>
> #12 0x7ff84200d6be in ObjectCacher::bh_write_commit
> (this=0x7ff7ec0010d0, poolid=<optimized out>, oid=...,
>
> ranges=std::vector of length 1, capacity 1 = {...}, tid=tid@entry=59669,
> r=r@entry=0) at osdc/ObjectCacher.cc:1117
>
> #13 0x7ff84201ba68 in ObjectCacher::C_WriteCommit::finish
> (this=0x7ff7f13de770, r=0) at osdc/ObjectCacher.h:613
>
> #14 0x7ff841b962c9 in Context::complete (this=0x7ff7f13de770,
> r=<optimized out>) at include/Context.h:64
>
> #15 0x7ff841c2065e in librbd::LibrbdWriteback::complete_writes
> (this=0x7ff7ec0077f0, oid=...) at librbd/LibrbdWriteback.cc:351
>
> #16 0x7ff841c222f4 in librbd::C_OrderedWrite::finish
> (this=0x7ff7f00011f0, r=0) at librbd/LibrbdWriteback.cc:87
>
> #17 0x7ff841b962c9 in Context::complete (this=0x7ff7f00011f0,
> r=<optimized out>) at include/Context.h:64
>
> #18 0x7ff841ba87f7 in
> librbd::AioObjectRequest<librbd::ImageCtx>::complete (this=0x7ff7f3e149e0,
> r=0)
>
> at librbd/AioObjectRequest.cc:97
>
> #19 0x7ff838439d8d in librados::C_AioSafe::finish (this=0x7ff7f4003340,
> r=<optimized out>) at librados/AioCompletionImpl.h:199
>
> #20 0x7ff838421c89 in Context::complete (this=0x7ff7f4003340,
> r=<optimized out>) at include/Context.h:64
>
> #21 0x7ff838513996 in Finisher::finisher_thread_entry
> (this=0x7ff81800e6f0) at common/Finisher.cc:68
>
> #22 0x7ff837650df3 in start_thread () from /lib64/libpthread.so.0
>
> #23 0x7ff83717a54d in clone () from /lib64/libc.so.6
>
> (gdb) p cct
>
> $1 = (CephContext *) 0x1a5869000
>
> (gdb) p cct->_conf
>
> Cannot access memory at address 0x1a5869008
>
> (gdb)
>
> 
>
>
> the ceph version is Jewel 10.2.3.1 and my fio.conf6 is as below:
>
> 
>
> [global]
>
> direct=1
>
> time_based
>
> thread
>
> group_reporting
>
>
> [fiorbd]
>
> rw=randwrite  #read、write、randread、randwrite
>
> bs=4k #4k、512k
>
> iodepth=32
>
> #ramp_time=30
>
> runtime=300
>
> ioengine=rbd
>
> clientname=admin
>
> pool=volumes
>
> rbdname=img6
>
> numjobs=1
>
> size=10240M
>
> -



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Interpretation Guidance for Slow Requests

2016-12-06 Thread Reed Dier

> On Dec 5, 2016, at 9:42 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Tue, 6 Dec 2016 03:37:32 +0100 Christian Theune wrote:
> 
>> Hi Christian (heh),
>> 
>> thanks for picking this up. :)
>> 
>> This has become a rather long post as I added more details and giving
>> our history, but if we make progress then maybe this can help others in
>> the future. I find slow requests extremely hard to debug and as I said:
>> aside from scratching my own itch, I’d be happy to help future
>> travellers.
>> 
>>> On 6 Dec 2016, at 00:59, Christian Balzer  wrote:
>>> 
>>> Hello,
>>> 
>>> On Mon, 5 Dec 2016 15:25:37 +0100 Christian Theune wrote:
>>> 
 Hi,
 
 we’re currently expanding our cluster to grow the number of IOPS we
 can provide to clients. We’re still on Hammer but in the process of
 upgrading to Jewel. 
>>> 
>>> You might want to wait until the next Jewel release, given the current
>>> number of issues.
>> 
>> Our issues or Jewel’s? (This is only half a joke, a friend’s Ceph
>> cluster’s issues on Jewel is making me quite nervous and I’m
>> re-evaluating postponing already.)
>> 
> Jewel issues, like the most recent one with scrub sending OSDs to
> neverland.
> 
 We started adding pure-SSD OSDs in the last days (based on MICRON
>>> S610DC-3840) and the slow requests we’ve seen in the past have started
>>> to show a different pattern.
 
>>> I looked in the archives and can't find a full description of your
>>> cluster hardware in any posts from you or the other Christian (hint,
>>> hint). Slow requests can nearly all the time being traced back to HW
>>> issues/limitations.
>> 
>> We’re currently running on 42% Christians. ;)
>> 
>> We currently have hosts of 2 generations in our cluster. They’re both
>> SuperMicro, sold by Thomas Krenn.
>> 
>> Type 1 (4 Hosts)
>> 
>>SuperMicro X9DR3-F
>>64 GiB RAM
>>2 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
> 
> A bit on the weak side, if you'd be doing lots of small IOPS, as you found
> out later. Not likely a real problem here, though.
> 
>>LSI MegaRAID SAS 9271-8i
> And as you already found out, not a good match for Ceph (unless you actual
> do RAIDs, like me with Areca)
> Also LSI tends to have issues unless you're on the latest drivers
> (firmware and kernel side)

Currently have our OSD’s behind an LSI 3108 RoC, in individual RAID 0 virtual 
disks. Curious to see if this looks to be an issue in this case, as we also see 
slow requests, though we are completely on-disk journaling, though evaluating 
NVMe journaling possibly as soon as today.

While not idea, they are BBU backed, which is the result of being hurt from a 
power outage during PoC stage, when we used on-device write cache, and 
corrupted the leveldb beyond repair.
So while an HBA would be ‘ideal,’ the BBU backed RAID has added a level of 
resiliency that could be helpful to some (verified with further destructive 
testing both with disk cache enabled/disabled to verify reproducibility).

> 
>>Dual 10-Gigabit X540-AT2
>>OS on RAID 1 SATA HGST HUS724020ALS640 1.818 TB 7.2k
>>Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
>> should be DCP3700, maybe my inventory is wrong)
>> 
> That needs clarification, which ones are they now?
> 
>>Pool “rbd.ssd”
>> 
>>2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
>> 
> You're clearly a much braver man than me or have a very light load (but at
> least the 3510's have 1 DWPD endurance unlike the 0.3 of 3500s).  
> Please clarify, journals inline (as you suggest below) or also on the NVME?
> 
>>Pool “rbd.hdd”
>> 
>>5-6 OSDs on SATA HGST HUS724020ALS640 1.818 TB 7.2k (jbod/raid 0)
>>1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
> Same here, journal inline or also on the NVME?
> 
>> 
>> Type 2 (2 Hosts)
>> 
>>SuperMicro X9SRL-F
>>32 GiB RAM
>>2 Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz
>>LSI MegaRAID SAS 9260-4i
>>Dual 10-Gigabit X540-AT2
>>OS on RAID 1 SATA (HGST HUS724020ALS640 1.818 TB)
>>Journal on 400G Intel MLC NVME PCI-E 3.0 (DC P3600) (I thought those
>> should be DCP3700, maybe my inventory is wrong)
>> 
>>Pool “rbd.ssd”
>> 
>>2 OSDs on 800GB Intel DC S3510 SSDSC2BX80 (jbod/raid 0)
>> 
>>Pool “rbd.hdd”
>> 
>>7 OSDs on HITACHI HUS156060VLS600 558.406 GB 15k (jbod/raid 0)
>>1 OSD on SAS MICRON S610DC-3840 3.492 TB (jbod/raid 0)
>> 
>> Network
>> 
>>10 GE Copper on Brocade VDX 6740-T Switches w/ 20 GBit interconnect
>> Storage Access and Cluster network on 2 physically separated ports
>> Some KVM servers are still running 1 GE interfaces for access, some
>> 10 GE
>> 
>> Software
>> 
>>Ceph: 0.94.7
>>Filesystems: OS on ext4, OSDs on xfs
>>Kernel: 4.4.27 (one host of type 2 still waiting for reboot to
>> upgrade from 4.1.16 in the next days) Qemu: 2.7 with librbd
>> 
>> Overall we’ve been more happy with the second type of hosts (the 15k
>> disks obviously don

Re: [ceph-users] Ceph Blog Articles

2016-12-06 Thread Sascha Vogt
Hi Nick,

m( of course, you're right. Yes, we have rbd_cache enabled for KVM /
QEMU. That probably also explains the large diff between avg and stdev.
Thanks for the Pointer.

Unfortunately I have not yet gotten fio to work with the rbd engine.
Always fails with

> rbd engine: RBD version: 0.1.9
> rbd_open failed.
> fio_rbd_connect failed.

Regardless if I set the clustername or not (to either ceph or the fsid)
and if I specify the clientname as ceph.client.admin, client.admin or
admin. Any pointer what I might be missing here?

Greetings
-Sascha-

Am 06.12.2016 um 15:49 schrieb Nick Fisk:
> Hi Sascha,
> 
> Have you got any write back caching enabled? That time looks very fast, 
> almost too fast to me. It looks like some of the writes
> completed in around 70us which is almost the same as a single hop of 10G 
> networking, where you would have at least 2 hops
> (Client->OSD1->OSD2).
> 
> What are your write cache settings for qemu?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Blog Articles

2016-12-06 Thread Nick Fisk
Hi Sascha,

Have you got any write back caching enabled? That time looks very fast, almost 
too fast to me. It looks like some of the writes
completed in around 70us which is almost the same as a single hop of 10G 
networking, where you would have at least 2 hops
(Client->OSD1->OSD2).

What are your write cache settings for qemu?

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Sascha Vogt
> Sent: 06 December 2016 12:14
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph Blog Articles
> 
> Hi Nick,
> 
> thanks for the parameters. As you were kind enough to share them, I thought 
> I'll share my results. I tested within a virtual
machine
> with the kvm rbd driver and used the following command line:
> 
> > fio --name=fio-test --randrepeat=0 --invalidate=0 --rw=write --bs=64k
> > --direct=1 --time_based=1 --runtime=360 --iodepth=1 --numjobs=1
> 
> And got the following results
> 
> >   write: io=126641MB, bw=360224KB/s, iops=5628, runt=360001msec
> > clat (usec): min=74, max=227697, avg=172.33, stdev=661.52
> >  lat (usec): min=75, max=227698, avg=174.31, stdev=661.55
> 
> I find it interesting that my stdev is so much higher than my average.
> Maybe it's due to the cluster setup. We have 2x10 GbE shared between 
> OpenStack data, Ceph client and Ceph data (separated via
> VLANs), all on a single switch (so no additional hops). Also the pool we are 
> effectively writing to (flash based cache pool in
front of an
> HDD pool, but big enough to not experience any flushes / evictions during the
> test) is a 30 OSD / 15 NVMe disk - size 2 one (journal and data are on the 
> same partition on the NVMes, each one has 4 partitions,
so
> no file based journal but raw partition)
> 
> Greetings
> -Sascha-
> 
> Am 05.12.2016 um 17:16 schrieb Nick Fisk:
> > Hi Sascha,
> >
> > Here is what I used
> >
> > [global]
> > ioengine=rbd
> > randrepeat=0
> > clientname=admin
> > pool=
> > rbdname=test
> > invalidate=0# mandatory
> > rw=write
> > bs=64k
> > direct=1
> > time_based=1
> > runtime=360
> > numjobs=1
> >
> > [rbd_iodepth1]
> > iodepth=1
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Sascha Vogt
> >> Sent: 05 December 2016 14:08
> >> To: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph Blog Articles
> >>
> >> Hi Nick,
> >>
> >> thanks for sharing your results. Would you be able to share the fio
> >> args you used for benchmarking (especially the ones for the screenshot you 
> >> shared in the write latency post)?
> >>
> >> What I found is that when I do some 4k write benchmarks my lat stdev
> >> is much higher then the average (also wider range for min vs max than). So 
> >> I wondered if it's my parameters or the cluster.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph recovery stuck

2016-12-06 Thread Ben Erridge
We are using ceph 80.9 and we recently recovered from a power outage which
caused some data loss. We had replica set to 1. Since then we have
installed another node with the idea that we would change the replica to 3.

We tried to change 1 of the pools to replica 3 but it always gets stuck.
It's been at the same degradation  level for over 12 hours now.


ceph -w
cluster 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d
 health HEALTH_WARN 64 pgs degraded; 64 pgs stuck unclean; recovery
10223/1636773 objects degraded (0.625%)
 monmap e3: 3 mons at {node-3=192.168.0.13:6789/0,
node-4=192.168.0.7:6789/0,node-5=192.168.0.8:6789/0}, election epoch
150454, quorum 0,1,2 node-4,node-5,node-3
 osdmap e914: 7 osds: 7 up, 7 in
  pgmap v3507206: 960 pgs, 15 pools, 9469 GB data, 1578 kobjects
9587 GB used, 57388 GB / 66976 GB avail
10223/1636773 objects degraded (0.625%)
 896 active+clean
  64 active+degraded


Any idea on what's going on or how we can get the process to resume?

-- 
-.
Ben Erridge
Center For Information Management, Inc.
(734) 930-0855
3550 West Liberty Road Ste 1
Ann Arbor, MI 48103
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stalls caused by scrub on jewel

2016-12-06 Thread Sage Weil
On Tue, 6 Dec 2016, Dan van der Ster wrote:
> Hi Sage,
> 
> Could you please clarify: do we need to set nodeep-scrub also, or does
> this somehow only affect the (shallow) scrub?
> 
> (Note that deep scrubs will start when the deep_scrub_interval has
> passed, even with noscrub set).

Hmm, I thought that 'noscrub' would also stop deep scrubs, but I just 
looked at the code and I was wrong.  So you should set nodeep-scrub too!

sage


> 
> Cheers, Dan
> 
> 
> On Tue, Nov 15, 2016 at 11:35 PM, Sage Weil  wrote:
> > Hi everyone,
> >
> > There was a regression in jewel that can trigger long OSD stalls during
> > scrub.  How long the stalls are depends on how many objects are in your
> > PGs, how fast your storage device is, and what is cached, but in at least
> > one case they were long enough that the OSD internal heartbeat check
> > failed and it committed suicide (120 seconds).
> >
> > The workaround for now is to simply
> >
> >  ceph osd set noscrub
> >
> > as the bug is only triggered by scrub.  A fix is being tested and will be
> > available shortly.
> >
> > If you've seen any kind of weird latencies or slow requests on jewel, I
> > suggest setting noscrub and seeing if they go away!
> >
> > The tracker bug is
> >
> >  http://tracker.ceph.com/issues/17859
> >
> > Big thanks to Yoann Moulin for helping track this down!
> >
> > sage
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Blog Articles

2016-12-06 Thread Sascha Vogt
Hi Nick,

thanks for the parameters. As you were kind enough to share them, I
thought I'll share my results. I tested within a virtual machine with
the kvm rbd driver and used the following command line:

> fio --name=fio-test --randrepeat=0 --invalidate=0 --rw=write --bs=64k 
> --direct=1 --time_based=1 --runtime=360 --iodepth=1 --numjobs=1

And got the following results

>   write: io=126641MB, bw=360224KB/s, iops=5628, runt=360001msec
> clat (usec): min=74, max=227697, avg=172.33, stdev=661.52
>  lat (usec): min=75, max=227698, avg=174.31, stdev=661.55

I find it interesting that my stdev is so much higher than my average.
Maybe it's due to the cluster setup. We have 2x10 GbE shared between
OpenStack data, Ceph client and Ceph data (separated via VLANs), all on
a single switch (so no additional hops). Also the pool we are
effectively writing to (flash based cache pool in front of an HDD pool,
but big enough to not experience any flushes / evictions during the
test) is a 30 OSD / 15 NVMe disk - size 2 one (journal and data are on
the same partition on the NVMes, each one has 4 partitions, so no file
based journal but raw partition)

Greetings
-Sascha-

Am 05.12.2016 um 17:16 schrieb Nick Fisk:
> Hi Sascha,
> 
> Here is what I used
> 
> [global]
> ioengine=rbd
> randrepeat=0
> clientname=admin
> pool=
> rbdname=test
> invalidate=0# mandatory
> rw=write
> bs=64k
> direct=1
> time_based=1
> runtime=360
> numjobs=1
> 
> [rbd_iodepth1]
> iodepth=1
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Sascha Vogt
>> Sent: 05 December 2016 14:08
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph Blog Articles
>>
>> Hi Nick,
>>
>> thanks for sharing your results. Would you be able to share the fio args you 
>> used for benchmarking (especially the ones for the
>> screenshot you shared in the write latency post)?
>>
>> What I found is that when I do some 4k write benchmarks my lat stdev is much 
>> higher then the average (also wider range for min vs
>> max than). So I wondered if it's my parameters or the cluster.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stalls caused by scrub on jewel

2016-12-06 Thread Dan van der Ster
Hi Sage,

Could you please clarify: do we need to set nodeep-scrub also, or does
this somehow only affect the (shallow) scrub?

(Note that deep scrubs will start when the deep_scrub_interval has
passed, even with noscrub set).

Cheers, Dan


On Tue, Nov 15, 2016 at 11:35 PM, Sage Weil  wrote:
> Hi everyone,
>
> There was a regression in jewel that can trigger long OSD stalls during
> scrub.  How long the stalls are depends on how many objects are in your
> PGs, how fast your storage device is, and what is cached, but in at least
> one case they were long enough that the OSD internal heartbeat check
> failed and it committed suicide (120 seconds).
>
> The workaround for now is to simply
>
>  ceph osd set noscrub
>
> as the bug is only triggered by scrub.  A fix is being tested and will be
> available shortly.
>
> If you've seen any kind of weird latencies or slow requests on jewel, I
> suggest setting noscrub and seeing if they go away!
>
> The tracker bug is
>
>  http://tracker.ceph.com/issues/17859
>
> Big thanks to Yoann Moulin for helping track this down!
>
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] segfault in ceph-fuse when quota is enabled

2016-12-06 Thread Goncalo Borges
Thanks Dan for your critical eye.
Somehow I did not notice that there was already a tracker for it. 
Cheers
G.

From: Dan van der Ster [d...@vanderster.com]
Sent: 06 December 2016 19:30
To: Goncalo Borges
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] segfault in ceph-fuse when quota is enabled

Hi Goncalo,

That bug is fixed in 10.2.4. See http://tracker.ceph.com/issues/16066

-- Dan



On Tue, Dec 6, 2016 at 5:11 AM, Goncalo Borges
 wrote:
> Hi John, Greg, Zheng
>
> And now a much more relevant problem. Once again, my environment:
>
> - ceph/cephfs in 10.2.2 but patched for
>   o client: add missing client_lock for get_root 
> (https://github.com/ceph/ceph/pull/10027)
>   o Jewel: segfault in ObjectCacher::FlusherThread 
> (http://tracker.ceph.com/issues/16610)
> - All infrastructure is in the same version (rados cluster, mons, mds and 
> cephfs clients).
> - We mount cephfs using ceph-fuse.
>
> Once we enabled quota in the clients (by using --client-quota), and exposed 
> the filesystem to the workload of our users, we got systematic segfaults. We 
> are able to reproduce them every time by asking the user to launch always the 
> same workload . Please note that the segfault systematically happens in the 
> clients where the quota is enabled. We still have a very few where it was not 
> possible to remount cephfs (because they were being heavily used), and in 
> those, no segfaults happen.
>
> Running ceph-fuse in debug mode, and with 'debug client = 20', we got
>
>  0> 2016-12-06 03:45:45.102680 7f40d3fff700 -1 client/Client.cc: In 
> function 'Inode* Client::get_quota_root(Inode*)' thread 7f40d3fff700 time 
> 2016-12-06 03:45:45.101937
> client/Client.cc: 12049: FAILED assert(root_ancestor->qtree == __null)
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x8b) [0x7f410845671b]
>  2: (Client::get_quota_root(Inode*)+0x7c5) [0x7f4108291315]
>  3: (Client::check_quota_condition(Inode*, std::function const&)>)+0x3d) [0x7f410829193d]
>  4: (Client::is_quota_bytes_exceeded(Inode*, long)+0x6e) [0x7f4108291ade]
>  5: (Client::_write(Fh*, long, unsigned long, char const*, iovec const*, 
> int)+0xce3) [0x7f41082a7323]
>  6: (Client::ll_write(Fh*, long, long, char const*)+0x94) [0x7f41082a87b4]
>  7: (()+0x197b46) [0x7f4108262b46]
>  8: (()+0x15294) [0x7f4107c82294]
>  9: (()+0x15b76) [0x7f4107c82b76]
>  10: (()+0x12aa9) [0x7f4107c7faa9]
>  11: (()+0x3db6a07aa1) [0x7f4106db8aa1]
>  12: (clone()+0x6d) [0x7f4106046aad]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
> Just opened a new tracker: http://tracker.ceph.com/issues/18152
>
> In the meantime, we will remove the --client-quota option.
>
> Cheers
> Goncalo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep-scrub cron job

2016-12-06 Thread Eugen Block

Hi,

I found my problem, it's the cron job. It starts the script every  
minute in the given hour, not just once as I wanted it to. So I guess  
this simply led to conflicts while searching for the oldest PGs or  
scrubbing them. I'm not sure yet, what this message exactly means, but  
I corrected the cron job and there was no such message in my inbox, so  
I hope this is resolved.



Zitat von Eugen Block :


Hi list,

I use the script from [1] to control the deep-scrubs myself in a  
cronjob. It seems to work fine, I get the "finished batch" message  
in /var/log/messages, but in every run I get an email from cron  
daemon with at least one line saying:


2016-11-30 21:40:59.271854 7f3d5700  0 monclient:  
handle_get_version_reply version request with handle 1 not found


These messages only differ in timestamp (of course) and the value  
after the timestamp, here are all messages I received so far:


---cut here---
2016-11-25 20:40:37.980014 7f64b149e700  0 monclient:  
handle_get_version_reply version request with handle 1 not found
2016-11-25 21:27:36.807596 7fb052581700  0 monclient:  
handle_get_version_reply version request with handle 1 not found
2016-11-27 16:09:46.216040 7fc965124700  0 monclient:  
handle_get_version_reply version request with handle 1 not found
2016-11-27 18:37:14.550263 7f6b4aabb700  0 monclient:  
handle_get_version_reply version request with handle 1 not found
2016-11-27 18:43:11.745695 7f90d1bd7700  0 monclient:  
handle_get_version_reply version request with handle 1 not found
2016-11-28 21:25:12.888939 7efc777fe700  0 monclient:  
handle_get_version_reply version request with handle 1 not found
2016-11-30 21:40:59.271854 7f3d5700  0 monclient:  
handle_get_version_reply version request with handle 1 not found

---cut here---

This is the cronjob:

* 20 * * 1,3,5 root /usr/local/bin/ceph_deep_scrub.sh 4
* 15 * * 0 root /usr/local/bin/ceph_deep_scrub.sh 4

It's supposed to run 4 times a week, scrubbing 1/4 of all PGs in each run.

The cronjob ran 4 times since last Friday.

I thought maybe this has to do with the "version" value in pg stats,  
so I dumped all PGs and their version value, but this doesn't seem  
to be the right place to look.


Can anybody explain what it's supposed to mean? Is there any action  
required to get rid of that?
I'm not able to reproduce that output, if I run the scipt manually  
there are no such messages.


Any help is appreciated!

Thanks,
Eugen

[1] http://prob6.com/en/ceph-pg-deep-scrub-cron/


--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983




--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] segfault in ceph-fuse when quota is enabled

2016-12-06 Thread Dan van der Ster
Hi Goncalo,

That bug is fixed in 10.2.4. See http://tracker.ceph.com/issues/16066

-- Dan



On Tue, Dec 6, 2016 at 5:11 AM, Goncalo Borges
 wrote:
> Hi John, Greg, Zheng
>
> And now a much more relevant problem. Once again, my environment:
>
> - ceph/cephfs in 10.2.2 but patched for
>   o client: add missing client_lock for get_root 
> (https://github.com/ceph/ceph/pull/10027)
>   o Jewel: segfault in ObjectCacher::FlusherThread 
> (http://tracker.ceph.com/issues/16610)
> - All infrastructure is in the same version (rados cluster, mons, mds and 
> cephfs clients).
> - We mount cephfs using ceph-fuse.
>
> Once we enabled quota in the clients (by using --client-quota), and exposed 
> the filesystem to the workload of our users, we got systematic segfaults. We 
> are able to reproduce them every time by asking the user to launch always the 
> same workload . Please note that the segfault systematically happens in the 
> clients where the quota is enabled. We still have a very few where it was not 
> possible to remount cephfs (because they were being heavily used), and in 
> those, no segfaults happen.
>
> Running ceph-fuse in debug mode, and with 'debug client = 20', we got
>
>  0> 2016-12-06 03:45:45.102680 7f40d3fff700 -1 client/Client.cc: In 
> function 'Inode* Client::get_quota_root(Inode*)' thread 7f40d3fff700 time 
> 2016-12-06 03:45:45.101937
> client/Client.cc: 12049: FAILED assert(root_ancestor->qtree == __null)
>
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x8b) [0x7f410845671b]
>  2: (Client::get_quota_root(Inode*)+0x7c5) [0x7f4108291315]
>  3: (Client::check_quota_condition(Inode*, std::function const&)>)+0x3d) [0x7f410829193d]
>  4: (Client::is_quota_bytes_exceeded(Inode*, long)+0x6e) [0x7f4108291ade]
>  5: (Client::_write(Fh*, long, unsigned long, char const*, iovec const*, 
> int)+0xce3) [0x7f41082a7323]
>  6: (Client::ll_write(Fh*, long, long, char const*)+0x94) [0x7f41082a87b4]
>  7: (()+0x197b46) [0x7f4108262b46]
>  8: (()+0x15294) [0x7f4107c82294]
>  9: (()+0x15b76) [0x7f4107c82b76]
>  10: (()+0x12aa9) [0x7f4107c7faa9]
>  11: (()+0x3db6a07aa1) [0x7f4106db8aa1]
>  12: (clone()+0x6d) [0x7f4106046aad]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
> Just opened a new tracker: http://tracker.ceph.com/issues/18152
>
> In the meantime, we will remove the --client-quota option.
>
> Cheers
> Goncalo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com