Re: [ceph-users] CephFS read IO caching, where it is happining?

2017-02-07 Thread Shinobu Kinjo
On Wed, Feb 8, 2017 at 3:05 PM, Ahmed Khuraidah  wrote:

> Hi Shinobu, I am using SUSE packages in scope of their latest SUSE
> Enterprise Storage 4 and following documentation (method of deployment:
> ceph-deploy)
> But, I was able reproduce this issue on Ubuntu 14.04 with Ceph
> repositories (also latest Jewel and ceph-deploy) as well.
>

Community Ceph packages are running on ubuntu box, right?
If so, please do `ceph -v` on ubuntu box.

And also please provide us with same issue which you hit on suse box.


>
> On Wed, Feb 8, 2017 at 3:03 AM, Shinobu Kinjo  wrote:
>
>> Are you using opensource Ceph packages or suse ones?
>>
>> On Sat, Feb 4, 2017 at 3:54 PM, Ahmed Khuraidah 
>> wrote:
>>
>>> I Have opened ticket on http://tracker.ceph.com/
>>>
>>> http://tracker.ceph.com/issues/18816
>>>
>>>
>>> My client and server kernels are the same, here is info:
>>> # lsb_release -a
>>> LSB Version:n/a
>>> Distributor ID: SUSE
>>> Description:SUSE Linux Enterprise Server 12 SP2
>>> Release:12.2
>>> Codename:   n/a
>>> # uname -a
>>> Linux cephnode 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016
>>> (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>> Thanks
>>>
>>> On Fri, Feb 3, 2017 at 1:59 PM, John Spray  wrote:
>>>
 On Fri, Feb 3, 2017 at 8:07 AM, Ahmed Khuraidah 
 wrote:
 > Thank you guys,
 >
 > I tried to add option "exec_prerun=echo 3 > /proc/sys/vm/drop_caches"
 as
 > well as "exec_prerun=echo 3 | sudo tee /proc/sys/vm/drop_caches", but
 > despite FIO corresponds that command was executed, there are no
 changes.
 >
 > But, I caught very strange another behavior. If I will run my FIO test
 > (speaking about 3G file case) twice, after the first run FIO will
 create my
 > file and print a lot of IOps as described already, but if- before
 second
 > run- drop cache (by root echo 3 > /proc/sys/vm/drop_caches) I broke
 will end
 > with broken MDS:
 >
 > --- begin dump of recent events ---
 >  0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 *** Caught signal
 > (Aborted) **
 >  in thread 7f7e8ec5e700 thread_name:ms_dispatch
 >
 >  ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318e
 d0dcedc8734)
 >  1: (()+0x5142a2) [0x557c51e092a2]
 >  2: (()+0x10b00) [0x7f7e95df2b00]
 >  3: (gsignal()+0x37) [0x7f7e93ccb8d7]
 >  4: (abort()+0x13a) [0x7f7e93aa]
 >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 > const*)+0x265) [0x557c51f133d5]
 >  6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]
 >  7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_relea
 se()+0x39)
 > [0x557c51b2ccf9]
 >  8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned
 long, bool,
 > unsigned long, utime_t)+0x9a7) [0x557c51ca2757]
 >  9: (Locker::remove_client_cap(CInode*, client_t)+0xb1)
 [0x557c51ca38f1]
 >  10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long,
 unsigned
 > int, unsigned int)+0x90d) [0x557c51ca424d]
 >  11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc)
 > [0x557c51ca449c]
 >  12: (MDSRank::handle_deferrable_message(Message*)+0xc1c)
 [0x557c51b33d3c]
 >  13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]
 >  14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]
 >  15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]
 >  16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]
 >  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]
 >  18: (()+0x8734) [0x7f7e95dea734]
 >  19: (clone()+0x6d) [0x7f7e93d80d3d]
 >  NOTE: a copy of the executable, or `objdump -rdS ` is
 needed to
 > interpret this.

 Oops!  Please could you open a ticket on tracker.ceph.com, with this
 backtrace, the client versions, any non-default config settings, and
 the series of operations that led up to it.

 Thanks,
 John

 > "
 >
 > On Thu, Feb 2, 2017 at 9:30 PM, Shinobu Kinjo 
 wrote:
 >>
 >> You may want to add this in your FIO recipe.
 >>
 >>  * exec_prerun=echo 3 > /proc/sys/vm/drop_caches
 >>
 >> Regards,
 >>
 >> On Fri, Feb 3, 2017 at 12:36 AM, Wido den Hollander 
 wrote:
 >> >
 >> >> Op 2 februari 2017 om 15:35 schreef Ahmed Khuraidah
 >> >> :
 >> >>
 >> >>
 >> >> Hi all,
 >> >>
 >> >> I am still confused about my CephFS sandbox.
 >> >>
 >> >> When I am performing simple FIO test into single file with size
 of 3G I
 >> >> have too many IOps:
 >> >>
 >> >> cephnode:~ # fio payloadrandread64k3G
 >> >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K,
 ioengine=libaio,
 >> >> iodepth=2
 >> >> fio-2.13
 >> >> Starting 1 process
 >> >> test: Laying out IO file(s) (1 file(s) / 3072MB)
 >> >> Jobs: 1 (f=1): [r(1

Re: [ceph-users] CephFS read IO caching, where it is happining?

2017-02-07 Thread Ahmed Khuraidah
Hi Shinobu, I am using SUSE packages in scope of their latest SUSE
Enterprise Storage 4 and following documentation (method of deployment:
ceph-deploy)
But, I was able reproduce this issue on Ubuntu 14.04 with Ceph repositories
(also latest Jewel and ceph-deploy) as well.

On Wed, Feb 8, 2017 at 3:03 AM, Shinobu Kinjo  wrote:

> Are you using opensource Ceph packages or suse ones?
>
> On Sat, Feb 4, 2017 at 3:54 PM, Ahmed Khuraidah 
> wrote:
>
>> I Have opened ticket on http://tracker.ceph.com/
>>
>> http://tracker.ceph.com/issues/18816
>>
>>
>> My client and server kernels are the same, here is info:
>> # lsb_release -a
>> LSB Version:n/a
>> Distributor ID: SUSE
>> Description:SUSE Linux Enterprise Server 12 SP2
>> Release:12.2
>> Codename:   n/a
>> # uname -a
>> Linux cephnode 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016
>> (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>> Thanks
>>
>> On Fri, Feb 3, 2017 at 1:59 PM, John Spray  wrote:
>>
>>> On Fri, Feb 3, 2017 at 8:07 AM, Ahmed Khuraidah 
>>> wrote:
>>> > Thank you guys,
>>> >
>>> > I tried to add option "exec_prerun=echo 3 > /proc/sys/vm/drop_caches"
>>> as
>>> > well as "exec_prerun=echo 3 | sudo tee /proc/sys/vm/drop_caches", but
>>> > despite FIO corresponds that command was executed, there are no
>>> changes.
>>> >
>>> > But, I caught very strange another behavior. If I will run my FIO test
>>> > (speaking about 3G file case) twice, after the first run FIO will
>>> create my
>>> > file and print a lot of IOps as described already, but if- before
>>> second
>>> > run- drop cache (by root echo 3 > /proc/sys/vm/drop_caches) I broke
>>> will end
>>> > with broken MDS:
>>> >
>>> > --- begin dump of recent events ---
>>> >  0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 *** Caught signal
>>> > (Aborted) **
>>> >  in thread 7f7e8ec5e700 thread_name:ms_dispatch
>>> >
>>> >  ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318e
>>> d0dcedc8734)
>>> >  1: (()+0x5142a2) [0x557c51e092a2]
>>> >  2: (()+0x10b00) [0x7f7e95df2b00]
>>> >  3: (gsignal()+0x37) [0x7f7e93ccb8d7]
>>> >  4: (abort()+0x13a) [0x7f7e93aa]
>>> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> > const*)+0x265) [0x557c51f133d5]
>>> >  6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]
>>> >  7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_relea
>>> se()+0x39)
>>> > [0x557c51b2ccf9]
>>> >  8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long,
>>> bool,
>>> > unsigned long, utime_t)+0x9a7) [0x557c51ca2757]
>>> >  9: (Locker::remove_client_cap(CInode*, client_t)+0xb1)
>>> [0x557c51ca38f1]
>>> >  10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long,
>>> unsigned
>>> > int, unsigned int)+0x90d) [0x557c51ca424d]
>>> >  11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc)
>>> > [0x557c51ca449c]
>>> >  12: (MDSRank::handle_deferrable_message(Message*)+0xc1c)
>>> [0x557c51b33d3c]
>>> >  13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]
>>> >  14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]
>>> >  15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]
>>> >  16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]
>>> >  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]
>>> >  18: (()+0x8734) [0x7f7e95dea734]
>>> >  19: (clone()+0x6d) [0x7f7e93d80d3d]
>>> >  NOTE: a copy of the executable, or `objdump -rdS ` is
>>> needed to
>>> > interpret this.
>>>
>>> Oops!  Please could you open a ticket on tracker.ceph.com, with this
>>> backtrace, the client versions, any non-default config settings, and
>>> the series of operations that led up to it.
>>>
>>> Thanks,
>>> John
>>>
>>> > "
>>> >
>>> > On Thu, Feb 2, 2017 at 9:30 PM, Shinobu Kinjo 
>>> wrote:
>>> >>
>>> >> You may want to add this in your FIO recipe.
>>> >>
>>> >>  * exec_prerun=echo 3 > /proc/sys/vm/drop_caches
>>> >>
>>> >> Regards,
>>> >>
>>> >> On Fri, Feb 3, 2017 at 12:36 AM, Wido den Hollander 
>>> wrote:
>>> >> >
>>> >> >> Op 2 februari 2017 om 15:35 schreef Ahmed Khuraidah
>>> >> >> :
>>> >> >>
>>> >> >>
>>> >> >> Hi all,
>>> >> >>
>>> >> >> I am still confused about my CephFS sandbox.
>>> >> >>
>>> >> >> When I am performing simple FIO test into single file with size of
>>> 3G I
>>> >> >> have too many IOps:
>>> >> >>
>>> >> >> cephnode:~ # fio payloadrandread64k3G
>>> >> >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K,
>>> ioengine=libaio,
>>> >> >> iodepth=2
>>> >> >> fio-2.13
>>> >> >> Starting 1 process
>>> >> >> test: Laying out IO file(s) (1 file(s) / 3072MB)
>>> >> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [277.8MB/0KB/0KB /s] [/0/0
>>> >> >> iops]
>>> >> >> [eta 00m:00s]
>>> >> >> test: (groupid=0, jobs=1): err= 0: pid=3714: Thu Feb  2 07:07:01
>>> 2017
>>> >> >>   read : io=3072.0MB, bw=181101KB/s, iops=2829, runt= 17370msec
>>> >> >> slat (usec): min=4, max=386, avg=12.49, stdev= 6.90
>>> >> >> clat (usec): min=202, max=5673.5K, avg=690.81, 

Re: [ceph-users] New mailing list: opensuse-c...@opensuse.org

2017-02-07 Thread Tim Serong
On 02/08/2017 01:36 PM, Tim Serong wrote:
> Hi All,
> 
> We've just created a new opensuse-c...@opensuse.org mailing list.  The
> purpose of this list is discussion of Ceph specifically on openSUSE.
> For example, topics such as the following would all be welcome:
> 
> * Maintainership of projects on OBS under
>   https://build.opensuse.org/project/show/filesystems:ceph
> 
> * Feedback on our hopefully helpful wiki page at
>   https://en.opensuse.org/openSUSE:Ceph
> 
> * Issues with Ceph packages shipped with openSUSE Leap and Tumbleweed
> 
> * Ceph deployment tools on openSUSE
> 
> If general or non-distro specific Ceph-related topics come up, we'll
> provide direction back to the appropriate upstream list :-)

Subscription instructions would probably be helpful, too.  If you'd like
to subscribe, send an email to opensuse-ceph+subscr...@opensuse.org

Cheers,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New mailing list: opensuse-c...@opensuse.org

2017-02-07 Thread Tim Serong
Hi All,

We've just created a new opensuse-c...@opensuse.org mailing list.  The
purpose of this list is discussion of Ceph specifically on openSUSE.
For example, topics such as the following would all be welcome:

* Maintainership of projects on OBS under
  https://build.opensuse.org/project/show/filesystems:ceph

* Feedback on our hopefully helpful wiki page at
  https://en.opensuse.org/openSUSE:Ceph

* Issues with Ceph packages shipped with openSUSE Leap and Tumbleweed

* Ceph deployment tools on openSUSE

If general or non-distro specific Ceph-related topics come up, we'll
provide direction back to the appropriate upstream list :-)

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS read IO caching, where it is happining?

2017-02-07 Thread Shinobu Kinjo
Are you using opensource Ceph packages or suse ones?

On Sat, Feb 4, 2017 at 3:54 PM, Ahmed Khuraidah  wrote:

> I Have opened ticket on http://tracker.ceph.com/
>
> http://tracker.ceph.com/issues/18816
>
>
> My client and server kernels are the same, here is info:
> # lsb_release -a
> LSB Version:n/a
> Distributor ID: SUSE
> Description:SUSE Linux Enterprise Server 12 SP2
> Release:12.2
> Codename:   n/a
> # uname -a
> Linux cephnode 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016
> (2d3e9d4) x86_64 x86_64 x86_64 GNU/Linux
>
>
> Thanks
>
> On Fri, Feb 3, 2017 at 1:59 PM, John Spray  wrote:
>
>> On Fri, Feb 3, 2017 at 8:07 AM, Ahmed Khuraidah 
>> wrote:
>> > Thank you guys,
>> >
>> > I tried to add option "exec_prerun=echo 3 > /proc/sys/vm/drop_caches" as
>> > well as "exec_prerun=echo 3 | sudo tee /proc/sys/vm/drop_caches", but
>> > despite FIO corresponds that command was executed, there are no changes.
>> >
>> > But, I caught very strange another behavior. If I will run my FIO test
>> > (speaking about 3G file case) twice, after the first run FIO will
>> create my
>> > file and print a lot of IOps as described already, but if- before second
>> > run- drop cache (by root echo 3 > /proc/sys/vm/drop_caches) I broke
>> will end
>> > with broken MDS:
>> >
>> > --- begin dump of recent events ---
>> >  0> 2017-02-03 02:34:41.974639 7f7e8ec5e700 -1 *** Caught signal
>> > (Aborted) **
>> >  in thread 7f7e8ec5e700 thread_name:ms_dispatch
>> >
>> >  ceph version 10.2.4-211-g12b091b (12b091b4a40947aa43919e71a318e
>> d0dcedc8734)
>> >  1: (()+0x5142a2) [0x557c51e092a2]
>> >  2: (()+0x10b00) [0x7f7e95df2b00]
>> >  3: (gsignal()+0x37) [0x7f7e93ccb8d7]
>> >  4: (abort()+0x13a) [0x7f7e93aa]
>> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > const*)+0x265) [0x557c51f133d5]
>> >  6: (MutationImpl::~MutationImpl()+0x28e) [0x557c51bb9e1e]
>> >  7: (std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_relea
>> se()+0x39)
>> > [0x557c51b2ccf9]
>> >  8: (Locker::check_inode_max_size(CInode*, bool, bool, unsigned long,
>> bool,
>> > unsigned long, utime_t)+0x9a7) [0x557c51ca2757]
>> >  9: (Locker::remove_client_cap(CInode*, client_t)+0xb1)
>> [0x557c51ca38f1]
>> >  10: (Locker::_do_cap_release(client_t, inodeno_t, unsigned long,
>> unsigned
>> > int, unsigned int)+0x90d) [0x557c51ca424d]
>> >  11: (Locker::handle_client_cap_release(MClientCapRelease*)+0x1cc)
>> > [0x557c51ca449c]
>> >  12: (MDSRank::handle_deferrable_message(Message*)+0xc1c)
>> [0x557c51b33d3c]
>> >  13: (MDSRank::_dispatch(Message*, bool)+0x1e1) [0x557c51b3c991]
>> >  14: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x557c51b3dae5]
>> >  15: (MDSDaemon::ms_dispatch(Message*)+0xc3) [0x557c51b25703]
>> >  16: (DispatchQueue::entry()+0x78b) [0x557c5200d06b]
>> >  17: (DispatchQueue::DispatchThread::entry()+0xd) [0x557c51ee5dcd]
>> >  18: (()+0x8734) [0x7f7e95dea734]
>> >  19: (clone()+0x6d) [0x7f7e93d80d3d]
>> >  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to
>> > interpret this.
>>
>> Oops!  Please could you open a ticket on tracker.ceph.com, with this
>> backtrace, the client versions, any non-default config settings, and
>> the series of operations that led up to it.
>>
>> Thanks,
>> John
>>
>> > "
>> >
>> > On Thu, Feb 2, 2017 at 9:30 PM, Shinobu Kinjo 
>> wrote:
>> >>
>> >> You may want to add this in your FIO recipe.
>> >>
>> >>  * exec_prerun=echo 3 > /proc/sys/vm/drop_caches
>> >>
>> >> Regards,
>> >>
>> >> On Fri, Feb 3, 2017 at 12:36 AM, Wido den Hollander 
>> wrote:
>> >> >
>> >> >> Op 2 februari 2017 om 15:35 schreef Ahmed Khuraidah
>> >> >> :
>> >> >>
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I am still confused about my CephFS sandbox.
>> >> >>
>> >> >> When I am performing simple FIO test into single file with size of
>> 3G I
>> >> >> have too many IOps:
>> >> >>
>> >> >> cephnode:~ # fio payloadrandread64k3G
>> >> >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K,
>> ioengine=libaio,
>> >> >> iodepth=2
>> >> >> fio-2.13
>> >> >> Starting 1 process
>> >> >> test: Laying out IO file(s) (1 file(s) / 3072MB)
>> >> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [277.8MB/0KB/0KB /s] [/0/0
>> >> >> iops]
>> >> >> [eta 00m:00s]
>> >> >> test: (groupid=0, jobs=1): err= 0: pid=3714: Thu Feb  2 07:07:01
>> 2017
>> >> >>   read : io=3072.0MB, bw=181101KB/s, iops=2829, runt= 17370msec
>> >> >> slat (usec): min=4, max=386, avg=12.49, stdev= 6.90
>> >> >> clat (usec): min=202, max=5673.5K, avg=690.81, stdev=361
>> >> >>
>> >> >>
>> >> >> But if I will change size to file to 320G, looks like I skip the
>> cache:
>> >> >>
>> >> >> cephnode:~ # fio payloadrandread64k320G
>> >> >> test: (g=0): rw=randread, bs=64K-64K/64K-64K/64K-64K,
>> ioengine=libaio,
>> >> >> iodepth=2
>> >> >> fio-2.13
>> >> >> Starting 1 process
>> >> >> Jobs: 1 (f=1): [r(1)] [100.0% done] [4740KB/0KB/0KB /s] [74/0/0
>> iops]
>> >> >> [eta
>> >> >> 00m:00s]
>> >> >> test: (groupid=0, jobs=1): err= 0

[ceph-users] ceph-monstore-tool rebuild assert error

2017-02-07 Thread Sean Sullivan
I have a hammer cluster that died a bit ago (hammer 94.9) consisting of 3
monitors and 630 osds spread across 21 storage hosts. The clusters monitors
all died due to leveldb corruption and the cluster was shut down. I was
finally given word that I could try to revive the cluster this week!

https://github.com/ceph/ceph/blob/hammer/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds

I see that the latest hammer code in github has the ceph-monstore-tool
rebuild backport and that is what I am running on the cluster now (ceph
version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c). I
was able to scrape all 630 of the osds and am left with a 1.1G store.db
directory. Using python I was successfully able to list all of the keys and
values which was very promising. That said I can not run the final command
in the recovery-using-osds article (ceph-monstore-tool rebuild)
successfully.

Whenever I run the tool (with the newly created admin keyring or with my
existing one) it errors with the following:


   1.  0> 2017-02-17 15:00:47.516901 7f8b4d7408c0 -1
./mon/MonitorDBStore.h:
   In function 'KeyValueDB::Iterator MonitorDBStore::get_iterator(const
   string&)' thread 7f8b4d7408c0 time 2017-02-07 15:00:47.516319
   2.


The complete trace is here
http://pastebin.com/NQE8uYiG

Can anyone lend a hand and tell me what may be wrong? I am able to iterate
over the leveldb database in python so the structure should be somewhat
okay? Am I SOL at this point? The cluster isn't production any longer and
while I don't have months of time I would really like to recover this
cluster just to see if it is at all possible.
-- 
- Sean:  I wrote this. -
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-07 Thread Nick Fisk
Yeah it’s probably just the fact that they have more PG’s so they will hold 
more data and thus serve more IO. As they have a fixed IO limit, they will 
always hit the limit first and become the bottleneck.

 

The main problem with reducing the filestore queue is that I believe you will 
start to lose the benefit of having IO’s queued up on the disk, so that the 
scheduler can re-arrange them to action them in the most efficient manor as the 
disk head moves across the platters. You might possibly see up to a 20% hit on 
performance, in exchange for more consistent client latency. 

 

From: Steve Taylor [mailto:steve.tay...@storagecraft.com] 
Sent: 07 February 2017 20:35
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

Thanks, Nick.

 

One other data point that has come up is that nearly all of the blocked 
requests that are waiting on subops are waiting for OSDs with more PGs than the 
others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB OSDs. 
The cluster is well balanced based on OSD capacity, so those 7 OSDs 
individually have 33% more PGs than the others and are causing almost all of 
the blocked requests. It appears that maps updates are generally not blocking 
long enough to show up as blocked requests.

 

I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. I’ll 
test some more when the PG counts per OSD are more balanced and see what I get. 
I’ll also play with the filestore queue. I was telling some of my colleagues 
yesterday that this looked likely to be related to buffer bloat somewhere. I 
appreciate the suggestion.

 

  _  


  

Steve Taylor | Senior Software Engineer |   
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _  

From: Nick Fisk [mailto:n...@fisk.me.uk] 
Sent: Tuesday, February 7, 2017 10:25 AM
To: Steve Taylor ; ceph-users@lists.ceph.com
Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

Hi Steve,

 

>From what I understand, the issue is not with the queueing in Ceph, which is 
>correctly moving client IO to the front of the queue. The problem lies below 
>what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s 
>leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost 
>in large disk queues surrounded by all the snap trim IO’s.

 

The workaround Sam is working on will limit the amount of snap trims that are 
allowed to run, which I believe will have a similar effect to the sleep 
parameters in pre-jewel clusters, but without pausing the whole IO thread.

 

Ultimately the solution requires Ceph to be able to control the queuing of IO’s 
at the lower levels of the kernel. Whether this is via some sort of tagging per 
IO (currently CFQ is only per thread/process) or some other method, I don’t 
know. I was speaking to Sage and he thinks the easiest method might be to 
shrink the filestore queue so that you don’t get buffer bloat at the disk 
level. You should be able to test this out pretty easily now by changing the 
parameter, probably around a queue of 5-10 would be about right for spinning 
disks. It’s a trade off of peak throughput vs queue latency though.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 07 February 2017 17:01
To: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

As I look at more of these stuck ops, it looks like more of them are actually 
waiting on subops than on osdmap updates, so maybe there is still some headway 
to be made with the weighted priority queue settings. I do see OSDs waiting for 
map updates all the time, but they aren’t blocking things as much as the subops 
are. Thoughts?

 

  _  


 

 

Steve Taylor | Senior Software Engineer |  

 StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _  

From: Steve T

Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-07 Thread Steve Taylor
Thanks, Nick.

One other data point that has come up is that nearly all of the blocked 
requests that are waiting on subops are waiting for OSDs with more PGs than the 
others. My test cluster has 184 OSDs, 177 of which are 3TB, with 7 4TB OSDs. 
The cluster is well balanced based on OSD capacity, so those 7 OSDs 
individually have 33% more PGs than the others and are causing almost all of 
the blocked requests. It appears that maps updates are generally not blocking 
long enough to show up as blocked requests.

I set the reweight on those 7 OSDs to 0.75 and things are backfilling now. I’ll 
test some more when the PG counts per OSD are more balanced and see what I get. 
I’ll also play with the filestore queue. I was telling some of my colleagues 
yesterday that this looked likely to be related to buffer bloat somewhere. I 
appreciate the suggestion.




[cid:image1bd943.JPG@b026bd80.43945ba2]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: Tuesday, February 7, 2017 10:25 AM
To: Steve Taylor ; ceph-users@lists.ceph.com
Subject: RE: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Hi Steve,

From what I understand, the issue is not with the queueing in Ceph, which is 
correctly moving client IO to the front of the queue. The problem lies below 
what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s 
leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost in 
large disk queues surrounded by all the snap trim IO’s.

The workaround Sam is working on will limit the amount of snap trims that are 
allowed to run, which I believe will have a similar effect to the sleep 
parameters in pre-jewel clusters, but without pausing the whole IO thread.

Ultimately the solution requires Ceph to be able to control the queuing of IO’s 
at the lower levels of the kernel. Whether this is via some sort of tagging per 
IO (currently CFQ is only per thread/process) or some other method, I don’t 
know. I was speaking to Sage and he thinks the easiest method might be to 
shrink the filestore queue so that you don’t get buffer bloat at the disk 
level. You should be able to test this out pretty easily now by changing the 
parameter, probably around a queue of 5-10 would be about right for spinning 
disks. It’s a trade off of peak throughput vs queue latency though.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 07 February 2017 17:01
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

As I look at more of these stuck ops, it looks like more of them are actually 
waiting on subops than on osdmap updates, so maybe there is still some headway 
to be made with the weighted priority queue settings. I do see OSDs waiting for 
map updates all the time, but they aren’t blocking things as much as the subops 
are. Thoughts?


[cid:image001.jpg@01D28146.3CD2FDC0]

Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: Steve Taylor
Sent: Tuesday, February 7, 2017 9:13 AM
To: 'ceph-users@lists.ceph.com' 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Sorry, I lost the previous thread on this. I apologize for the resulting 
incomplete reply.

The issue that we’re having with Jewel, as David Turner mentioned, is that we 
can’t seem to throttle snap trimming sufficiently to prevent it from blocking 
I/O requests. On further investigation, I encountered 
osd_op_pq_max_tokens_per_priority, which should be able to be used in 
conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue 
positions fo

Re: [ceph-users] osd being down and out

2017-02-07 Thread David Turner
The noup and/or noin flags could be useful for this.  Depending on why you want 
to prevent it rejoining the cluster you would use one or the other, or both.



[cid:image37fd7c.JPG@c0e7bb43.4bbeb276]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Patrick 
McGarry [pmcga...@redhat.com]
Sent: Tuesday, February 07, 2017 10:14 AM
To: nigel davies; Ceph-User
Subject: Re: [ceph-users] osd being down and out

Moving this to ceph-user where it belongs.


On Tue, Feb 7, 2017 at 8:33 AM, nigel davies  wrote:
> Hay
>
> Is their any way, to set ceph, so that if a OSD goes down and comes backup
> ceph will not put it back in. service?
>
> Thanks



--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph pool resize

2017-02-07 Thread Vikhyat Umrao
On Tue, Feb 7, 2017 at 12:15 PM, Patrick McGarry 
wrote:

> Moving this to ceph-user
>
> On Mon, Feb 6, 2017 at 3:51 PM, nigel davies  wrote:
> > Hay
> >
> > I am helping to run an small ceph cluster 2 nodes set up.
> >
> > We have recently bought a 3rd storage node and the management want to
> > increase the replication from two to three.
> >
> > As soon as i changed the pool size from 2 to 3, the cluster go's in to
> > warning.
>

Can you please attach below command outputs in a pastebin.

$ceph osd dump | grep -i pool

and crushmap.txt in pastebin

$ceph osd getcrushmap -o /tmp/crushmap
$crushtool -d crushmap -o /tmp/curshmap.txt


> >
> >  health HEALTH_WARN
> > 512 pgs degraded
> > 512 pgs stuck unclean
> > 512 pgs undersized
> > recovery 5560/19162 objects degraded (29.016%)
> > election epoch 50, quorum 0,1
> >  osdmap e243: 20 osds: 20 up, 20 in
> > flags sortbitwise
> >   pgmap v79260: 2624 pgs, 3 pools, 26873 MB data, 6801 objects
> > 54518 MB used, 55808 GB / 55862 GB avail
> > 5560/19162 objects degraded (29.016%)
> > 2112 active+clean
> >  512 active+undersized+degraded
> >
> > The cluster is not recovering it self, any help would be grate full on
> this
> >
> >
>
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Workaround for XFS lockup resulting in down OSDs

2017-02-07 Thread Thorvald Natvig
Hi,

We've encountered a small "kernel feature" in XFS using Filestore. We
have a workaround, and would like to share in case others have the
same problem.

Under high load, on slow storage, with lots of dirty buffers and low
memory, there's a design choice with unfortunate side-effects if you
have multiple XFS filesystems mounted, such as often is the case when
you have a JBOD full of drives. This results in network traffic
stalling, leading to OSDs failing heartbeats.

In short, when the kernel needs to allocate memory for anything, it
first figures out how many pages it needs, then goes to each
filesystem and says "release N pages". In XFS, that's implemented as
follows:

- For each AG (8 in our case):
  - Try to lock AG
  - Release unused buffers, up to N
- If this point is reached, and we didn't manage to release at least N
pages, try again, but this time wait for the lock.

That last part is the problem; if the lock is currently held by, say,
another kernel thread that is currently flushing dirty buffers, then
the memory allocation stalls. However, we have 30 other XFS
filesystems that could release memory, and the kernel also has a lot
of non-filesystem memory that can be released.

This manifests as OSDs going offline during high load, with other OSDs
claiming that the OSD stopped responding to health checks. This is
especially prevalent during cache tier flushing and large backfills,
which can put very heavy load on the write buffers, thus increasing
the probability of one of these events.
In reality, the OSD is stuck in the kernel, trying to allocate buffers
to build a TCP packet to answer the network message. As soon as the
buffers are flushed (which can take a while), the OSD recovers, but
now has to deal with being marked down in the monitor maps.

The following systemtap changes the kernel behavior to not do the lock-waiting:

probe module("xfs").function("xfs_reclaim_inodes_ag").call {
 $flags = $flags & 2
}

Save it to a file, and run with 'stap -v -g -d kernel
--suppress-time-limits '. We've been running this for a
few weeks, and the issue is completely gone.

There was a writeup on the XFS mailing list a while ago about the same
issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
unfortunately it didn't result in consensus on a patch. This problem
won't exist in BlueStore, so we consider the systemtap approach a
workaround until we're ready to deploy BlueStore.

- Thorvald
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-07 Thread Nick Fisk
Hi Steve,

 

>From what I understand, the issue is not with the queueing in Ceph, which is 
>correctly moving client IO to the front of the queue. The problem lies below 
>what Ceph controls, ie the scheduler and disk layer in Linux. Once the IO’s 
>leave Ceph it’s a bit of a free for all and the client IO’s tend to get lost 
>in large disk queues surrounded by all the snap trim IO’s.

 

The workaround Sam is working on will limit the amount of snap trims that are 
allowed to run, which I believe will have a similar effect to the sleep 
parameters in pre-jewel clusters, but without pausing the whole IO thread.

 

Ultimately the solution requires Ceph to be able to control the queuing of IO’s 
at the lower levels of the kernel. Whether this is via some sort of tagging per 
IO (currently CFQ is only per thread/process) or some other method, I don’t 
know. I was speaking to Sage and he thinks the easiest method might be to 
shrink the filestore queue so that you don’t get buffer bloat at the disk 
level. You should be able to test this out pretty easily now by changing the 
parameter, probably around a queue of 5-10 would be about right for spinning 
disks. It’s a trade off of peak throughput vs queue latency though.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Steve 
Taylor
Sent: 07 February 2017 17:01
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

As I look at more of these stuck ops, it looks like more of them are actually 
waiting on subops than on osdmap updates, so maybe there is still some headway 
to be made with the weighted priority queue settings. I do see OSDs waiting for 
map updates all the time, but they aren’t blocking things as much as the subops 
are. Thoughts?

 

  _  


  

Steve Taylor | Senior Software Engineer |   
StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 

  _  


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

  _  

From: Steve Taylor 
Sent: Tuesday, February 7, 2017 9:13 AM
To: 'ceph-users@lists.ceph.com' mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

 

Sorry, I lost the previous thread on this. I apologize for the resulting 
incomplete reply.

 

The issue that we’re having with Jewel, as David Turner mentioned, is that we 
can’t seem to throttle snap trimming sufficiently to prevent it from blocking 
I/O requests. On further investigation, I encountered 
osd_op_pq_max_tokens_per_priority, which should be able to be used in 
conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue 
positions for various operations using costs if I understand correctly. I’m 
testing with RBDs using 4MB objects, so in order to leave plenty of room in the 
weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority 
to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially 
reserve 32MB in the queue for client I/O operations, which are prioritized 
higher and therefore shouldn’t get blocked.

 

I still see blocked I/O requests, and when I dump in-flight ops, they show ‘op 
must wait for map.’ I assume this means that what’s blocking the I/O requests 
at this point is all of the osdmap updates caused by snap trimming, and not the 
actual snap trimming itself starving the ops of op threads. Hammer is able to 
mitigate this with osd_snap_trim_sleep by directly throttling snap trimming and 
therefore causing less frequent osdmap updates, but there doesn’t seem to be a 
good way to accomplish the same thing with Jewel.

 

First of all, am I understanding these settings correctly? If so, are there 
other settings that could potentially help here, or do we just need something 
like Sam already mentioned that can sort of reserve threads for client I/O 
requests? Even then it seems like we might have issues if we can’t also 
throttle snap trimming. We delete a LOT of RBD snapshots on a daily basis, 
which we recognize is an extreme use case. Just wondering if there’s something 
else to try or if we need to start working toward implementing something new 
ourselves to handle our use case better.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph pool resize

2017-02-07 Thread Patrick McGarry
Moving this to ceph-user

On Mon, Feb 6, 2017 at 3:51 PM, nigel davies  wrote:
> Hay
>
> I am helping to run an small ceph cluster 2 nodes set up.
>
> We have recently bought a 3rd storage node and the management want to
> increase the replication from two to three.
>
> As soon as i changed the pool size from 2 to 3, the cluster go's in to
> warning.
>
>  health HEALTH_WARN
> 512 pgs degraded
> 512 pgs stuck unclean
> 512 pgs undersized
> recovery 5560/19162 objects degraded (29.016%)
> election epoch 50, quorum 0,1
>  osdmap e243: 20 osds: 20 up, 20 in
> flags sortbitwise
>   pgmap v79260: 2624 pgs, 3 pools, 26873 MB data, 6801 objects
> 54518 MB used, 55808 GB / 55862 GB avail
> 5560/19162 objects degraded (29.016%)
> 2112 active+clean
>  512 active+undersized+degraded
>
> The cluster is not recovering it self, any help would be grate full on this
>
>



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Latency between datacenters

2017-02-07 Thread Daniel Picolli Biazus
Hi Guys,

I have been planning to deploy a Ceph Cluster with the following hardware:

*OSDs:*

4 Servers Xeon D 1520 / 32 GB RAM / 5 x 6TB SAS 2 (6 OSD daemon per server)

Monitor/Rados Gateways

5 Servers Xeon D 1520 32 GB RAM / 2 x 1TB SAS 2 (5 MON daemon/ 4 rados
daemon)

Usage: Object Storage only

However I need to deploy 2 OSD and 3 MON Servers in Miami datacenter
and another 2 OSD and 2 MON Servers in Montreal Datacenter. The latency
between these datacenters is 50 milliseconds.
   Considering this scenario, should I use Federated Gateways or should I
use a single Cluster ?

Thanks in advance
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd being down and out

2017-02-07 Thread Patrick McGarry
Moving this to ceph-user where it belongs.


On Tue, Feb 7, 2017 at 8:33 AM, nigel davies  wrote:
> Hay
>
> Is their any way, to set ceph, so that if a OSD goes down and comes backup
> ceph will not put it back in. service?
>
> Thanks



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-07 Thread Steve Taylor
As I look at more of these stuck ops, it looks like more of them are actually 
waiting on subops than on osdmap updates, so maybe there is still some headway 
to be made with the weighted priority queue settings. I do see OSDs waiting for 
map updates all the time, but they aren’t blocking things as much as the subops 
are. Thoughts?




[cid:image99464a.JPG@898dfa11.4e81d597]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


From: Steve Taylor
Sent: Tuesday, February 7, 2017 9:13 AM
To: 'ceph-users@lists.ceph.com' 
Subject: Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

Sorry, I lost the previous thread on this. I apologize for the resulting 
incomplete reply.

The issue that we’re having with Jewel, as David Turner mentioned, is that we 
can’t seem to throttle snap trimming sufficiently to prevent it from blocking 
I/O requests. On further investigation, I encountered 
osd_op_pq_max_tokens_per_priority, which should be able to be used in 
conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue 
positions for various operations using costs if I understand correctly. I’m 
testing with RBDs using 4MB objects, so in order to leave plenty of room in the 
weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority 
to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially 
reserve 32MB in the queue for client I/O operations, which are prioritized 
higher and therefore shouldn’t get blocked.

I still see blocked I/O requests, and when I dump in-flight ops, they show ‘op 
must wait for map.’ I assume this means that what’s blocking the I/O requests 
at this point is all of the osdmap updates caused by snap trimming, and not the 
actual snap trimming itself starving the ops of op threads. Hammer is able to 
mitigate this with osd_snap_trim_sleep by directly throttling snap trimming and 
therefore causing less frequent osdmap updates, but there doesn’t seem to be a 
good way to accomplish the same thing with Jewel.

First of all, am I understanding these settings correctly? If so, are there 
other settings that could potentially help here, or do we just need something 
like Sam already mentioned that can sort of reserve threads for client I/O 
requests? Even then it seems like we might have issues if we can’t also 
throttle snap trimming. We delete a LOT of RBD snapshots on a daily basis, 
which we recognize is an extreme use case. Just wondering if there’s something 
else to try or if we need to start working toward implementing something new 
ourselves to handle our use case better.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_snap_trim_sleep keeps locks PG during sleep?

2017-02-07 Thread Steve Taylor
Sorry, I lost the previous thread on this. I apologize for the resulting 
incomplete reply.

The issue that we’re having with Jewel, as David Turner mentioned, is that we 
can’t seem to throttle snap trimming sufficiently to prevent it from blocking 
I/O requests. On further investigation, I encountered 
osd_op_pq_max_tokens_per_priority, which should be able to be used in 
conjunction with ‘osd_op_queue = wpq’ to govern the availability of queue 
positions for various operations using costs if I understand correctly. I’m 
testing with RBDs using 4MB objects, so in order to leave plenty of room in the 
weighted priority queue for client I/O, I set osd_op_pq_max_tokens_per_priority 
to 64MB and osd_snap_trim_cost to 32MB+1. I figured this should essentially 
reserve 32MB in the queue for client I/O operations, which are prioritized 
higher and therefore shouldn’t get blocked.

I still see blocked I/O requests, and when I dump in-flight ops, they show ‘op 
must wait for map.’ I assume this means that what’s blocking the I/O requests 
at this point is all of the osdmap updates caused by snap trimming, and not the 
actual snap trimming itself starving the ops of op threads. Hammer is able to 
mitigate this with osd_snap_trim_sleep by directly throttling snap trimming and 
therefore causing less frequent osdmap updates, but there doesn’t seem to be a 
good way to accomplish the same thing with Jewel.

First of all, am I understanding these settings correctly? If so, are there 
other settings that could potentially help here, or do we just need something 
like Sam already mentioned that can sort of reserve threads for client I/O 
requests? Even then it seems like we might have issues if we can’t also 
throttle snap trimming. We delete a LOT of RBD snapshots on a daily basis, 
which we recognize is an extreme use case. Just wondering if there’s something 
else to try or if we need to start working toward implementing something new 
ourselves to handle our use case better.



[cid:imagef15e00.JPG@e8bcd715.4a89bd4c]   Steve 
Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 |



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC pool migrations

2017-02-07 Thread David Turner
If you successfully get every object into the cache tier and then flush it to 
the new pool, you've copied every object in your cluster twice.  And as you 
mentioned, you can't guarantee that the flush will do what you need.  I don't 
have much experience with RGW, but would it work to write a loop through your 
objects (rados ls) to access them and copy them directly to the new pool using 
RGW?  Of course this will be problematic to getting everything if the cluster 
is allowed to receive writes during the copy, but if you can ensure that it 
would be read only during this time then it should work.



[cid:image5e36d1.JPG@e4e853f2.4ebfaa8e]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Blair 
Bethwaite [blair.bethwa...@gmail.com]
Sent: Tuesday, February 07, 2017 8:32 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] EC pool migrations

On 7 February 2017 at 23:50, Blair Bethwaite  wrote:
> 1) insert a large enough temporary replicated pool as a cache tier
> 2) somehow force promotion of every object into the cache (don't see
> any way to do that other than actually read them - but at least some
> creative scripting could do that in parallel)
> 3) once #objects in cache = #objects in old backing pool
> then stop radosgw services
> 4) remove overlay and tier remove
> 6) now we should have identical or newer data in the temporary
> replicated pool and no caching relationship
> then add the temporary replicated pool as a tier (--force-nonempty) to
> the new EC pool
> 7) finally cache-flush-evict-all and remove the temporary replicated pool

That all seemed to work right up to the final step. flushing/evicting
of course doesn't guarantee the contents of the cache get written to
the backing pool. This must be because it's properties in the cache
that decide whether to forward anything down to the backing pool and
as these objects haven't been changed the cache just throws them
away... Is there a way around this?

--
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC pool migrations

2017-02-07 Thread Blair Bethwaite
On 7 February 2017 at 23:50, Blair Bethwaite  wrote:
> 1) insert a large enough temporary replicated pool as a cache tier
> 2) somehow force promotion of every object into the cache (don't see
> any way to do that other than actually read them - but at least some
> creative scripting could do that in parallel)
> 3) once #objects in cache = #objects in old backing pool
> then stop radosgw services
> 4) remove overlay and tier remove
> 6) now we should have identical or newer data in the temporary
> replicated pool and no caching relationship
> then add the temporary replicated pool as a tier (--force-nonempty) to
> the new EC pool
> 7) finally cache-flush-evict-all and remove the temporary replicated pool

That all seemed to work right up to the final step. flushing/evicting
of course doesn't guarantee the contents of the cache get written to
the backing pool. This must be because it's properties in the cache
that decide whether to forward anything down to the backing pool and
as these objects haven't been changed the cache just throws them
away... Is there a way around this?

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon unable to reach quorum

2017-02-07 Thread lee_yiu_ch...@yahoo.com

lee_yiu_ch...@yahoo.com 於 18/1/2017 11:17 寫道:

Dear all,

I have a ceph installation (dev site) with two nodes, each running a mon daemon 
and osd daemon.
(Yes, I know running a cluster of two mon is bad, but I have no choice since I 
only have two nodes.)

Now, the two nodes are migrated to another datacenter, but after it is booted 
up the mon daemon are
unable to reach quorum. How can I proceed? (If there is no way to recover, I 
can accept the loss but
I wish to know how to avoid this to happen again.)

(omitted log here)

Turns out that we did not have proper MTU configured in our new switch after moving the servers to 
new datacenter (we thought the switch is already configured with proper MTU), but only discovered 
this issue after we nuked out all the nodes and reinstalled ceph, which have the exact same symptoms 
after reinstalling. Only this we realized there may be network problem inside and had a hard time to 
find out the root cause.


Well, we can take this chance to test bluestore in ceph Kraken...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EC pool migrations

2017-02-07 Thread Blair Bethwaite
Hi all,

Wondering if anyone has come up with a quick and minimal impact way of
moving data between erasure coded pools? We want to shrink an existing
EC pool (also changing the EC profile at the same time) that backs our
main RGW buckets. Thus far the only successful way I've found of
managing the switch-over uses rados cppool like:

1) stop radosgw services
2) rados cppool  
3) ceph osd pool rename  .old
4) ceph osd pool rename  
5) start radosgw services

This is slow and error prone. From what I can tell it actually
transfers the data between the pools via the client. Not good. Also
means radosgw service is offline throughout the transfer.

My next idea was to use cache tiering, but the obvious process there
(insert new pool as backing under old pool as cache) doesn't work
because the cache tier cannot be erasure coded. I'm now wondering if
it would be feasible to:

1) insert a large enough temporary replicated pool as a cache tier
2) somehow force promotion of every object into the cache (don't see
any way to do that other than actually read them - but at least some
creative scripting could do that in parallel)
3) once #objects in cache = #objects in old backing pool
then stop radosgw services
4) remove overlay and tier remove
6) now we should have identical or newer data in the temporary
replicated pool and no caching relationship
then add the temporary replicated pool as a tier (--force-nonempty) to
the new EC pool
7) finally cache-flush-evict-all and remove the temporary replicated pool

I'm starting to work through testing this now but would appreciate if
anyone can save me some time or suggest a better option.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "Numerical argument out of domain" error occurs during rbd export-diff | rbd import-diff

2017-02-07 Thread Bernhard J . M . Grün
Hello,

I just created a bug report for this: http://tracker.ceph.com/issues/18844

Not to be able to import-diff (already exported diffs) could result in a
loss of data. Therefore I thought it would be wise to create a bug report
for that.

Best regards,

Bernhard J. M. Grün

-- 
Freundliche Grüße

Bernhard J. M. Grün, Püttlingen, Deutschland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph -s require_jewel_osds pops up and disappears

2017-02-07 Thread Bernhard J . M . Grün
Hi,

I also had that flickering indicator.
The solution for me was quite simple: I forgot to restart one of the
monitors after the upgrade (this is not done automatically on CentOS 7 at
least).

Hope this helps

Bernhard

Götz Reinicke  schrieb am Di., 7. Feb. 2017
um 11:39 Uhr:

> Hi,
>
> Ceph -s shows like a direction indicator on of  require_jewel_osds.
>
> I did recently an upgrade from centos 7.2 to 7.3 and ceph 10.2.3 to 10.2.5.
>
> May be I forgot to set an option?
>
> I thought I did a „ ceph osd set require_jewel_osds“ as described in the
> release notes https://ceph.com/geen-categorie/v10-2-4-jewel-released/
>
> Thanks for hints what to checke and maybe switch on and off again.
>
> Regards . Götz
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
Freundliche Grüße

Bernhard J. M. Grün, Püttlingen, Deutschland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] virt-install into rbd hangs during Anaconda package installation

2017-02-07 Thread Tracy Reed
On Tue, Feb 07, 2017 at 12:25:08AM PST, koukou73gr spake thusly:
> On 2017-02-07 10:11, Tracy Reed wrote:
> > Weird. Now the VMs that were hung in interruptable wait state have now
> > disappeared. No idea why.
> 
> Have you tried the same procedure but with local storage instead?

Yes. I have local storage and iSCSI storage and they both install just
fine.

-- 
Tracy Reed


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph -s require_jewel_osds pops up and disappears

2017-02-07 Thread Götz Reinicke
Hi,

Ceph -s shows like a direction indicator on of  require_jewel_osds.

I did recently an upgrade from centos 7.2 to 7.3 and ceph 10.2.3 to 10.2.5.

May be I forgot to set an option?

I thought I did a „ ceph osd set require_jewel_osds“ as described in the 
release notes https://ceph.com/geen-categorie/v10-2-4-jewel-released/

Thanks for hints what to checke and maybe switch on and off again.

Regards . Götz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] virt-install into rbd hangs during Anaconda package installation

2017-02-07 Thread koukou73gr
On 2017-02-07 10:11, Tracy Reed wrote:
> Weird. Now the VMs that were hung in interruptable wait state have now
> disappeared. No idea why.

Have you tried the same procedure but with local storage instead?

-K.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] virt-install into rbd hangs during Anaconda package installation

2017-02-07 Thread Tracy Reed
Weird. Now the VMs that were hung in interruptable wait state have now
disappeared. No idea why.

Additional information:

ceph-mds-10.2.3-0.el7.x86_64
python-cephfs-10.2.3-0.el7.x86_64
ceph-osd-10.2.3-0.el7.x86_64
ceph-radosgw-10.2.3-0.el7.x86_64
libcephfs1-10.2.3-0.el7.x86_64
ceph-common-10.2.3-0.el7.x86_64
ceph-base-10.2.3-0.el7.x86_64
ceph-10.2.3-0.el7.x86_64
ceph-selinux-10.2.3-0.el7.x86_64
ceph-mon-10.2.3-0.el7.x86_64

cluster b2b00aae-f00d-41b4-a29b-58859aa41375
 health HEALTH_OK
 monmap e11: 3 mons at 
{ceph01=10.0.5.2:6789/0,ceph03=10.0.5.4:6789/0,ceph07=10.0.5.13:6789/0}
election epoch 76, quorum 0,1,2 ceph01,ceph03,ceph07
 osdmap e14396: 70 osds: 66 up, 66 in
flags sortbitwise,require_jewel_osds
  pgmap v7116569: 1664 pgs, 3 pools, 7876 GB data, 1969 kobjects
23648 GB used, 24310 GB / 47958 GB avail
1661 active+clean
   2 active+clean+scrubbing+deep
   1 active+clean+scrubbing
  client io 839 kB/s wr, 0 op/s rd, 159 op/s wr


On Mon, Feb 06, 2017 at 06:57:23PM PST, Tracy Reed spake thusly:
> This is what I'm doing on my CentOS 7/KVM/virtlib server:
> 
> rbd create --size 20G pool/vm.mydomain.com
> 
> rbd map pool/vm.mydomain.com --name client.admin
> 
> virt-install --name vm.mydomain.com --ram 2048 --disk 
> path=/dev/rbd/pool/vm.mydomain.com  --vcpus 1  --os-type linux --os-variant 
> rhel6 --network bridge=dmz --graphics none --console pty,target_type=serial 
> --location http://repo.mydomain.com/centos/7/os/x86_64 --extra-args 
> "ip=en0:dhcp ks=http://repo.mydomain.com/ks/ks.cfg.vm console=ttyS0  
> ksdevice=eth0 
> inst.repo=http://10.0.10.5/http://repo.mydomain.com/centos/7/os/x86_64";
> 
> And then it creates partitions, filesystems (xfs), and
> starts installing packages. 9 times out of 10 it hangs while
> installing packages. And I have no idea why. I can't kill
> the VM. 
> 
> Trying to destroy it shows:
> 
> virsh # destroy vm.mydomain.com
> error: Failed to destroy domain vm.mydomain.com
> error: Failed to terminate process 19629 with SIGKILL:
> Device or resource busy
> 
> and then virsh ls shows:
> 
> virsh ls shows:
> 
> 127   vm.mydomain.comin shutdown
> 
> The log for this vm in
> /var/log/libvirt/qemu/vm.mydomain.com contains only:
> 
> 2017-02-06 08:14:12.256+: starting up libvirt version:
> 2.0.0, package: 10.el7_3.2 (CentOS BuildSystem
> , 2016-12-06-19:53:38,
> c1bm.rdu2.centos.org), qemu version: 1.5.3
> (qemu-kvm-1.5.3-105.el7_2.7), hostname: cpu01.mydomain.com
> LC_ALL=C
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
> QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name
> secclass2.mydomain.com -S -machine
> pc-i440fx-rhel7.0.0,accel=kvm,usb=off -cpu
> SandyBridge,+vme,+f16c,+rdrand,+fsgsbase,+smep,+erms -m
> 2048 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1
> -uuid 5dadf01e-b996-411f-b95f-26ce6b790bae -nographic
> -no-user-config -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-127-secclass2.mydomain./monitor.sock,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc
> base=utc,driftfix=slew -global
> kvm-pit.lost_tick_policy=discard -no-hpet -no-reboot
> -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1
> -boot strict=on -kernel
> /var/lib/libvirt/boot/virtinst-vmlinuz.9Ax4zt -initrd
> /var/lib/libvirt/boot/virtinst-initrd.img.ALJE43 -append
> 'ip=en0:dhcp ks=http://util1.mydomain.com/ks/ks.cfg.vm.
> console=ttyS0  ksdevice=eth0
> inst.repo=http://10.0.10.5/http://util1.mydomain.com/centos/7/os/x86_64
> method=http://util1.mydomain.com/centos/7/os/x86_64'
> -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7
> -device
> ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5
> -device
> ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1
> -device
> ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2
> -device
> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4
> -drive
> file=/dev/rbd/security-class/secclass2.mydomain.com,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> -netdev tap,fd=55,id=hostnet0,vhost=on,vhostfd=57 -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:87:d2:12,bus=pci.0,addr=0x3
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0 -chardev
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-127-secclass2.mydomain./org.qemu.guest_agent.0,server,nowait
> -device
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
> -device usb-tablet,id=input0,bus=usb.0,port=1 -device
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg
> timestamp=on
> char device redirected to /dev/pts/24 (label charserial0)
> qemu: terminating on signal 15 from pid 23385
> 
> Any