Re: [ceph-users] cluster is not stable

2019-03-12 Thread huang jun
can you get the value of osd_beacon_report_interval item? the default
is 300, you can set to 60,  or maybe turn on debug_ms=1 debug_mon=10
can get more infos.


Zhenshi Zhou  于2019年3月13日周三 下午1:20写道:
>
> Hi,
>
> The servers are cennected to the same switch.
> I can ping from anyone of the servers to other servers
> without a packet lost and the average round trip time
> is under 0.1 ms.
>
> Thanks
>
> Ashley Merrick  于2019年3月13日周三 下午12:06写道:
>>
>> Can you ping all your OSD servers from all your mons, and ping your mons 
>> from all your OSD servers?
>>
>> I’ve seen this where a route wasn’t working one direction, so it made OSDs 
>> flap when it used that mon to check availability:
>>
>> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  wrote:
>>>
>>> After checking the network and syslog/dmsg, I think it's not the network or 
>>> hardware issue. Now there're some
>>> osds being marked down every 15 minutes.
>>>
>>> here is ceph.log:
>>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : 
>>> cluster [INF] Cluster is now healthy
>>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 : 
>>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 : 
>>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 : 
>>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 : 
>>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 : 
>>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 : 
>>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 : 
>>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 : 
>>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 : 
>>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 : 
>>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 : 
>>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 : 
>>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 : 
>>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 : 
>>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 : 
>>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 : 
>>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 : 
>>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 : 
>>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 : 
>>> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 : 
>>> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 : 
>>> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 : 
>>> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 : 
>>> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 : 
>>> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 : 
>>> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
>>> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 : 
>>> cluster [INF] osd.34 marked down aft

Re: [ceph-users] S3 data on specific storage systems

2019-03-12 Thread Konstantin Shalygin

I have a cluster with SSD and HDD storage. I wonder how to configure S3
buckets on HDD storage backends only.
Do I need to create pools on this particular storage and define radosgw
placement with those or there is a better or easier way to achieve this ?


Just assign your "crush hdd rule" to your data poolpool via `ceph osd 
pool set  crush_rule `.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi,

The servers are cennected to the same switch.
I can ping from anyone of the servers to other servers
without a packet lost and the average round trip time
is under 0.1 ms.

Thanks

Ashley Merrick  于2019年3月13日周三 下午12:06写道:

> Can you ping all your OSD servers from all your mons, and ping your mons
> from all your OSD servers?
>
> I’ve seen this where a route wasn’t working one direction, so it made OSDs
> flap when it used that mon to check availability:
>
> On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  wrote:
>
>> After checking the network and syslog/dmsg, I think it's not the network
>> or hardware issue. Now there're some
>> osds being marked down every 15 minutes.
>>
>> here is ceph.log:
>> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 :
>> cluster [INF] Cluster is now healthy
>> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 :
>> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 :
>> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 :
>> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 :
>> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 :
>> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 :
>> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 :
>> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 :
>> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 :
>> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 :
>> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 :
>> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 :
>> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 :
>> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 :
>> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 :
>> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 :
>> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 :
>> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 :
>> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 :
>> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 :
>> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 :
>> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 :
>> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 :
>> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 :
>> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 :
>> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 :
>> cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 :
>> cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds
>> 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 :
>> cluster [INF] osd.38 marked down after no beacon for 9

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Ashley Merrick
Can you ping all your OSD servers from all your mons, and ping your mons
from all your OSD servers?

I’ve seen this where a route wasn’t working one direction, so it made OSDs
flap when it used that mon to check availability:

On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou  wrote:

> After checking the network and syslog/dmsg, I think it's not the network
> or hardware issue. Now there're some
> osds being marked down every 15 minutes.
>
> here is ceph.log:
> 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 :
> cluster [INF] Cluster is now healthy
> 2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 :
> cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 :
> cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 :
> cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 :
> cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 :
> cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 :
> cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 :
> cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 :
> cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 :
> cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 :
> cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 :
> cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 :
> cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 :
> cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 :
> cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 :
> cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 :
> cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 :
> cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 :
> cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 :
> cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 :
> cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 :
> cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 :
> cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 :
> cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 :
> cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 :
> cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 :
> cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 :
> cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 :
> cluster [INF] osd.38 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.706974 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6785 :
> cluster [INF] osd.40 marked down after no beacon for 900.067020 seconds
> 2019-03-13 11:21:21.707013 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6786 :
> cluster [INF] osd.41 marked down after no beacon for 900.06702

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
After checking the network and syslog/dmsg, I think it's not the network or
hardware issue. Now there're some
osds being marked down every 15 minutes.

here is ceph.log:
2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 :
cluster [INF] Cluster is now healthy
2019-03-13 11:21:21.705787 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6757 :
cluster [INF] osd.1 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705858 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6758 :
cluster [INF] osd.2 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705920 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6759 :
cluster [INF] osd.4 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705957 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6760 :
cluster [INF] osd.6 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.705999 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6761 :
cluster [INF] osd.7 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706040 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6762 :
cluster [INF] osd.10 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706079 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6763 :
cluster [INF] osd.11 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706118 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6764 :
cluster [INF] osd.12 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706155 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6765 :
cluster [INF] osd.13 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706195 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6766 :
cluster [INF] osd.14 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706233 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6767 :
cluster [INF] osd.15 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706273 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6768 :
cluster [INF] osd.16 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706312 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6769 :
cluster [INF] osd.17 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706351 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6770 :
cluster [INF] osd.18 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706385 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6771 :
cluster [INF] osd.19 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706423 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6772 :
cluster [INF] osd.20 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706503 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6773 :
cluster [INF] osd.22 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706549 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6774 :
cluster [INF] osd.23 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706587 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6775 :
cluster [INF] osd.25 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706625 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6776 :
cluster [INF] osd.26 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706665 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6777 :
cluster [INF] osd.27 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706703 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6778 :
cluster [INF] osd.28 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706741 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6779 :
cluster [INF] osd.30 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706779 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6780 :
cluster [INF] osd.31 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706817 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6781 :
cluster [INF] osd.33 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706856 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6782 :
cluster [INF] osd.34 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706894 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6783 :
cluster [INF] osd.36 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706930 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6784 :
cluster [INF] osd.38 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.706974 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6785 :
cluster [INF] osd.40 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707013 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6786 :
cluster [INF] osd.41 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707051 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6787 :
cluster [INF] osd.42 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707090 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6788 :
cluster [INF] osd.44 marked down after no beacon for 900.067020 seconds
2019-03-13 11:21:21.707128 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6789 :
cluster [INF] osd.45 marked down after no bea

[ceph-users] RBD Mirror Image Resync

2019-03-12 Thread Vikas Rana
Hi there,

 

We are replicating a RBD image from Primary to DR site using RBD mirroring.

On Primary, we were using 10.2.10.

 

DR site is luminous and we promoted the DR copy to test the failure.
Everything checked out good.

 

Now we are trying to restart the replication and we did the demote and then
resync the image but it stuck in "starting_replay" state for last 3 days.
It's a 200TB RBD image

 

:~# rbd --cluster cephdr mirror pool status nfs --verbose

health: WARNING

images: 1 total

1 starting_replay

 

dir_research:

  global_id:   3ad67d0c-e06b-406a-9469-4e5faedd09a4

  state:   down+unknown

  description: status not found

  last_update:

 

 

#rbd info nfs/dir_research

rbd image 'dir_research':

size 200TiB in 52428800 objects

order 22 (4MiB objects)

block_name_prefix: rbd_data.652186b8b4567

format: 2

features: layering, exclusive-lock, journaling

flags:

create_timestamp: Thu Feb  7 11:53:36 2019

journal: 652186b8b4567

mirroring state: disabling

mirroring global id: 3ad67d0c-e06b-406a-9469-4e5faedd09a4

mirroring primary: false

 

 

 

So the question is, how do we know the progress of the replay and how much
its already completed and any ETA estimation on when it will go back to OK
state?

 

Thanks,

-Vikas 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] S3 data on specific storage systems

2019-03-12 Thread Paul Emmerich
One pool per storage class is enough, you can share the metadata pools
across different placement policies.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Mar 12, 2019 at 8:53 PM  wrote:
>
> Dear Ceph users,
>
> I have a cluster with SSD and HDD storage. I wonder how to configure S3
> buckets on HDD storage backends only.
> Do I need to create pools on this particular storage and define radosgw
> placement with those or there is a better or easier way to achieve this ?
>
> Regards,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs on ceph servers

2019-03-12 Thread Paul Emmerich
On Tue, Mar 12, 2019 at 8:56 PM David C  wrote:
>
> Out of curiosity, are you guys re-exporting the fs to clients over something 
> like nfs or running applications directly on the OSD nodes?

Kernel NFS + kernel CephFS can fall apart and deadlock itself in
exciting ways...

nfs-ganesha is so much better.

Paul

>
> On Tue, 12 Mar 2019, 18:28 Paul Emmerich,  wrote:
>>
>> Mounting kernel CephFS on an OSD node works fine with recent kernels
>> (4.14+) and enough RAM in the servers.
>>
>> We did encounter problems with older kernels though
>>
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Tue, Mar 12, 2019 at 10:07 AM Hector Martin  wrote:
>> >
>> > It's worth noting that most containerized deployments can effectively
>> > limit RAM for containers (cgroups), and the kernel has limits on how
>> > many dirty pages it can keep around.
>> >
>> > In particular, /proc/sys/vm/dirty_ratio (default: 20) means at most 20%
>> > of your total RAM can be dirty FS pages. If you set up your containers
>> > such that the cumulative memory usage is capped below, say, 70% of RAM,
>> > then this might effectively guarantee that you will never hit this issue.
>> >
>> > On 08/03/2019 02:17, Tony Lill wrote:
>> > > AFAIR the issue is that under memory pressure, the kernel will ask
>> > > cephfs to flush pages, but that this in turn causes the osd (mds?) to
>> > > require more memory to complete the flush (for network buffers, etc). As
>> > > long as cephfs and the OSDs are feeding from the same kernel mempool,
>> > > you are susceptible. Containers don't protect you, but a full VM, like
>> > > xen or kvm? would.
>> > >
>> > > So if you don't hit the low memory situation, you will not see the
>> > > deadlock, and you can run like this for years without a problem. I have.
>> > > But you are most likely to run out of memory during recovery, so this
>> > > could compound your problems.
>> > >
>> > > On 3/7/19 3:56 AM, Marc Roos wrote:
>> > >>
>> > >>
>> > >> Container =  same kernel, problem is with processes using the same
>> > >> kernel.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -Original Message-
>> > >> From: Daniele Riccucci [mailto:devs...@posteo.net]
>> > >> Sent: 07 March 2019 00:18
>> > >> To: ceph-users@lists.ceph.com
>> > >> Subject: Re: [ceph-users] mount cephfs on ceph servers
>> > >>
>> > >> Hello,
>> > >> is the deadlock risk still an issue in containerized deployments? For
>> > >> example with OSD daemons in containers and mounting the filesystem on
>> > >> the host machine?
>> > >> Thank you.
>> > >>
>> > >> Daniele
>> > >>
>> > >> On 06/03/19 16:40, Jake Grimmett wrote:
>> > >>> Just to add "+1" on this datapoint, based on one month usage on Mimic
>> > >>> 13.2.4 essentially "it works great for us"
>> > >>>
>> > >>> Prior to this, we had issues with the kernel driver on 12.2.2. This
>> > >>> could have been due to limited RAM on the osd nodes (128GB / 45 OSD),
>> > >>> and an older kernel.
>> > >>>
>> > >>> Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
>> > >>> allowed us to reliably use the kernel driver.
>> > >>>
>> > >>> We keep 30 snapshots ( one per day), have one active metadata server,
>> > >>> and change several TB daily - it's much, *much* faster than with fuse.
>> > >>>
>> > >>> Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.
>> > >>>
>> > >>> ta ta
>> > >>>
>> > >>> Jake
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On 3/6/19 11:10 AM, Hector Martin wrote:
>> >  On 06/03/2019 12:07, Zhenshi Zhou wrote:
>> > > Hi,
>> > >
>> > > I'm gonna mount cephfs from my ceph servers for some reason,
>> > > including monitors, metadata servers and osd servers. I know it's
>> > > not a best practice. But what is the exact potential danger if I
>> > > mount cephfs from its own server?
>> > 
>> >  As a datapoint, I have been doing this on two machines (single-host
>> >  Ceph
>> >  clusters) for months with no ill effects. The FUSE client performs a
>> >  lot worse than the kernel client, so I switched to the latter, and
>> >  it's been working well with no deadlocks.
>> > 
>> > >>> ___
>> > >>> ceph-users mailing list
>> > >>> ceph-users@lists.ceph.com
>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>
>> > >> ___
>> > >> ceph-users mailing list
>> > >> ceph-users@lists.ceph.com
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>
>> > >>
>> > >> ___
>> > >> ceph-users mailing list
>> > >> ceph-users@lists.ceph.com
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>
>> > >
>> > >
>> > > ___
>> > > ceph-u

Re: [ceph-users] mount cephfs on ceph servers

2019-03-12 Thread Hector Martin
Both, in my case (since host, both local services and NFS export use the
CephFS mount). I use the in-kernel NFS server (not nfs-ganesha).

On 13/03/2019 04.55, David C wrote:
> Out of curiosity, are you guys re-exporting the fs to clients over
> something like nfs or running applications directly on the OSD nodes? 
> 
> On Tue, 12 Mar 2019, 18:28 Paul Emmerich,  > wrote:
> 
> Mounting kernel CephFS on an OSD node works fine with recent kernels
> (4.14+) and enough RAM in the servers.
> 
> We did encounter problems with older kernels though
> 
> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90
> 
> On Tue, Mar 12, 2019 at 10:07 AM Hector Martin
> mailto:hec...@marcansoft.com>> wrote:
> >
> > It's worth noting that most containerized deployments can effectively
> > limit RAM for containers (cgroups), and the kernel has limits on how
> > many dirty pages it can keep around.
> >
> > In particular, /proc/sys/vm/dirty_ratio (default: 20) means at
> most 20%
> > of your total RAM can be dirty FS pages. If you set up your containers
> > such that the cumulative memory usage is capped below, say, 70% of
> RAM,
> > then this might effectively guarantee that you will never hit this
> issue.
> >
> > On 08/03/2019 02:17, Tony Lill wrote:
> > > AFAIR the issue is that under memory pressure, the kernel will ask
> > > cephfs to flush pages, but that this in turn causes the osd
> (mds?) to
> > > require more memory to complete the flush (for network buffers,
> etc). As
> > > long as cephfs and the OSDs are feeding from the same kernel
> mempool,
> > > you are susceptible. Containers don't protect you, but a full
> VM, like
> > > xen or kvm? would.
> > >
> > > So if you don't hit the low memory situation, you will not see the
> > > deadlock, and you can run like this for years without a problem.
> I have.
> > > But you are most likely to run out of memory during recovery, so
> this
> > > could compound your problems.
> > >
> > > On 3/7/19 3:56 AM, Marc Roos wrote:
> > >>
> > >>
> > >> Container =  same kernel, problem is with processes using the same
> > >> kernel.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -Original Message-
> > >> From: Daniele Riccucci [mailto:devs...@posteo.net
> ]
> > >> Sent: 07 March 2019 00:18
> > >> To: ceph-users@lists.ceph.com 
> > >> Subject: Re: [ceph-users] mount cephfs on ceph servers
> > >>
> > >> Hello,
> > >> is the deadlock risk still an issue in containerized
> deployments? For
> > >> example with OSD daemons in containers and mounting the
> filesystem on
> > >> the host machine?
> > >> Thank you.
> > >>
> > >> Daniele
> > >>
> > >> On 06/03/19 16:40, Jake Grimmett wrote:
> > >>> Just to add "+1" on this datapoint, based on one month usage
> on Mimic
> > >>> 13.2.4 essentially "it works great for us"
> > >>>
> > >>> Prior to this, we had issues with the kernel driver on 12.2.2.
> This
> > >>> could have been due to limited RAM on the osd nodes (128GB /
> 45 OSD),
> > >>> and an older kernel.
> > >>>
> > >>> Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
> > >>> allowed us to reliably use the kernel driver.
> > >>>
> > >>> We keep 30 snapshots ( one per day), have one active metadata
> server,
> > >>> and change several TB daily - it's much, *much* faster than
> with fuse.
> > >>>
> > >>> Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2
> coding.
> > >>>
> > >>> ta ta
> > >>>
> > >>> Jake
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 3/6/19 11:10 AM, Hector Martin wrote:
> >  On 06/03/2019 12:07, Zhenshi Zhou wrote:
> > > Hi,
> > >
> > > I'm gonna mount cephfs from my ceph servers for some reason,
> > > including monitors, metadata servers and osd servers. I know
> it's
> > > not a best practice. But what is the exact potential danger if I
> > > mount cephfs from its own server?
> > 
> >  As a datapoint, I have been doing this on two machines
> (single-host
> >  Ceph
> >  clusters) for months with no ill effects. The FUSE client
> performs a
> >  lot worse than the kernel client, so I switched to the
> latter, and
> >  it's been working well with no deadlocks.
> > 
> > >>> ___
> > >>> ceph-users mai

Re: [ceph-users] mount cephfs on ceph servers

2019-03-12 Thread David C
Out of curiosity, are you guys re-exporting the fs to clients over
something like nfs or running applications directly on the OSD nodes?

On Tue, 12 Mar 2019, 18:28 Paul Emmerich,  wrote:

> Mounting kernel CephFS on an OSD node works fine with recent kernels
> (4.14+) and enough RAM in the servers.
>
> We did encounter problems with older kernels though
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Tue, Mar 12, 2019 at 10:07 AM Hector Martin 
> wrote:
> >
> > It's worth noting that most containerized deployments can effectively
> > limit RAM for containers (cgroups), and the kernel has limits on how
> > many dirty pages it can keep around.
> >
> > In particular, /proc/sys/vm/dirty_ratio (default: 20) means at most 20%
> > of your total RAM can be dirty FS pages. If you set up your containers
> > such that the cumulative memory usage is capped below, say, 70% of RAM,
> > then this might effectively guarantee that you will never hit this issue.
> >
> > On 08/03/2019 02:17, Tony Lill wrote:
> > > AFAIR the issue is that under memory pressure, the kernel will ask
> > > cephfs to flush pages, but that this in turn causes the osd (mds?) to
> > > require more memory to complete the flush (for network buffers, etc).
> As
> > > long as cephfs and the OSDs are feeding from the same kernel mempool,
> > > you are susceptible. Containers don't protect you, but a full VM, like
> > > xen or kvm? would.
> > >
> > > So if you don't hit the low memory situation, you will not see the
> > > deadlock, and you can run like this for years without a problem. I
> have.
> > > But you are most likely to run out of memory during recovery, so this
> > > could compound your problems.
> > >
> > > On 3/7/19 3:56 AM, Marc Roos wrote:
> > >>
> > >>
> > >> Container =  same kernel, problem is with processes using the same
> > >> kernel.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -Original Message-
> > >> From: Daniele Riccucci [mailto:devs...@posteo.net]
> > >> Sent: 07 March 2019 00:18
> > >> To: ceph-users@lists.ceph.com
> > >> Subject: Re: [ceph-users] mount cephfs on ceph servers
> > >>
> > >> Hello,
> > >> is the deadlock risk still an issue in containerized deployments? For
> > >> example with OSD daemons in containers and mounting the filesystem on
> > >> the host machine?
> > >> Thank you.
> > >>
> > >> Daniele
> > >>
> > >> On 06/03/19 16:40, Jake Grimmett wrote:
> > >>> Just to add "+1" on this datapoint, based on one month usage on Mimic
> > >>> 13.2.4 essentially "it works great for us"
> > >>>
> > >>> Prior to this, we had issues with the kernel driver on 12.2.2. This
> > >>> could have been due to limited RAM on the osd nodes (128GB / 45 OSD),
> > >>> and an older kernel.
> > >>>
> > >>> Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
> > >>> allowed us to reliably use the kernel driver.
> > >>>
> > >>> We keep 30 snapshots ( one per day), have one active metadata server,
> > >>> and change several TB daily - it's much, *much* faster than with
> fuse.
> > >>>
> > >>> Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.
> > >>>
> > >>> ta ta
> > >>>
> > >>> Jake
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 3/6/19 11:10 AM, Hector Martin wrote:
> >  On 06/03/2019 12:07, Zhenshi Zhou wrote:
> > > Hi,
> > >
> > > I'm gonna mount cephfs from my ceph servers for some reason,
> > > including monitors, metadata servers and osd servers. I know it's
> > > not a best practice. But what is the exact potential danger if I
> > > mount cephfs from its own server?
> > 
> >  As a datapoint, I have been doing this on two machines (single-host
> >  Ceph
> >  clusters) for months with no ill effects. The FUSE client performs a
> >  lot worse than the kernel client, so I switched to the latter, and
> >  it's been working well with no deadlocks.
> > 
> > >>> ___
> > >>> ceph-users mailing list
> > >>> ceph-users@lists.ceph.com
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>
> > >
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> > --
> > Hector Martin (hec...@marcansoft.com)
> > Public Key: https://mrcn.st/pub
> > ___
> > ceph-users mailing list
> > ceph-users

[ceph-users] S3 data on specific storage systems

2019-03-12 Thread Yannick.Martin
Dear Ceph users,

I have a cluster with SSD and HDD storage. I wonder how to configure S3 
buckets on HDD storage backends only.
Do I need to create pools on this particular storage and define radosgw 
placement with those or there is a better or easier way to achieve this ?

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs on ceph servers

2019-03-12 Thread Paul Emmerich
Mounting kernel CephFS on an OSD node works fine with recent kernels
(4.14+) and enough RAM in the servers.

We did encounter problems with older kernels though


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Mar 12, 2019 at 10:07 AM Hector Martin  wrote:
>
> It's worth noting that most containerized deployments can effectively
> limit RAM for containers (cgroups), and the kernel has limits on how
> many dirty pages it can keep around.
>
> In particular, /proc/sys/vm/dirty_ratio (default: 20) means at most 20%
> of your total RAM can be dirty FS pages. If you set up your containers
> such that the cumulative memory usage is capped below, say, 70% of RAM,
> then this might effectively guarantee that you will never hit this issue.
>
> On 08/03/2019 02:17, Tony Lill wrote:
> > AFAIR the issue is that under memory pressure, the kernel will ask
> > cephfs to flush pages, but that this in turn causes the osd (mds?) to
> > require more memory to complete the flush (for network buffers, etc). As
> > long as cephfs and the OSDs are feeding from the same kernel mempool,
> > you are susceptible. Containers don't protect you, but a full VM, like
> > xen or kvm? would.
> >
> > So if you don't hit the low memory situation, you will not see the
> > deadlock, and you can run like this for years without a problem. I have.
> > But you are most likely to run out of memory during recovery, so this
> > could compound your problems.
> >
> > On 3/7/19 3:56 AM, Marc Roos wrote:
> >>
> >>
> >> Container =  same kernel, problem is with processes using the same
> >> kernel.
> >>
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Daniele Riccucci [mailto:devs...@posteo.net]
> >> Sent: 07 March 2019 00:18
> >> To: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] mount cephfs on ceph servers
> >>
> >> Hello,
> >> is the deadlock risk still an issue in containerized deployments? For
> >> example with OSD daemons in containers and mounting the filesystem on
> >> the host machine?
> >> Thank you.
> >>
> >> Daniele
> >>
> >> On 06/03/19 16:40, Jake Grimmett wrote:
> >>> Just to add "+1" on this datapoint, based on one month usage on Mimic
> >>> 13.2.4 essentially "it works great for us"
> >>>
> >>> Prior to this, we had issues with the kernel driver on 12.2.2. This
> >>> could have been due to limited RAM on the osd nodes (128GB / 45 OSD),
> >>> and an older kernel.
> >>>
> >>> Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
> >>> allowed us to reliably use the kernel driver.
> >>>
> >>> We keep 30 snapshots ( one per day), have one active metadata server,
> >>> and change several TB daily - it's much, *much* faster than with fuse.
> >>>
> >>> Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.
> >>>
> >>> ta ta
> >>>
> >>> Jake
> >>>
> >>>
> >>>
> >>>
> >>> On 3/6/19 11:10 AM, Hector Martin wrote:
>  On 06/03/2019 12:07, Zhenshi Zhou wrote:
> > Hi,
> >
> > I'm gonna mount cephfs from my ceph servers for some reason,
> > including monitors, metadata servers and osd servers. I know it's
> > not a best practice. But what is the exact potential danger if I
> > mount cephfs from its own server?
> 
>  As a datapoint, I have been doing this on two machines (single-host
>  Ceph
>  clusters) for months with no ill effects. The FUSE client performs a
>  lot worse than the kernel client, so I switched to the latter, and
>  it's been working well with no deadlocks.
> 
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Safe to remove objects from default.rgw.meta ?

2019-03-12 Thread Dan van der Ster
Answering my own question (getting help from Pavan), I see that all
the details are in this PR: https://github.com/ceph/ceph/pull/11051

So, the zone was updated to set metadata_heap: "" with

$ radosgw-admin zone get --rgw-zone=default > zone.json
[edit zone.json]
$ radosgw-admin zone set --rgw-zone=default --infile=zone.json

and now I can safely remove the default.rgw.meta pool.

-- Dan


On Tue, Mar 12, 2019 at 3:17 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We have an S3 cluster with >10 million objects in default.rgw.meta.
>
> # radosgw-admin zone get | jq .metadata_heap
> "default.rgw.meta"
>
> In these old tickets I realized that this setting is obsolete, and
> those objects are probably useless:
>http://tracker.ceph.com/issues/17256
>http://tracker.ceph.com/issues/18174
>
> We will clear the metadata_heap setting in the zone json, but then can
> we simply `rados rm` all the objects in the default.rgw.meta pool?
>
> The objects seem to come in three flavours:
>
>.meta:user:dvanders:_KpWMw94jrX75PgAfhDymKTo:2
>.meta:bucket:atlas-eventservice:_byPmpJS9V9l7DULEVxlDC2A:1
>
> .meta:bucket.instance:atlas-eventservice:61c59385-085d-4caa-9070-63a3868dccb6.3191998.599860:_PQCKPJVTzvtwgU41Dw0Cdx6:1
>
> Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif

I bet you'd see better memstore results with my vector based object
implementation instead of bufferlists.


Where can I find it?


Nick Fisk noticed the same
thing you did.  One interesting observation he made was that disabling
CPU C/P states helped bluestore immensely in the iodepth=1 case.


This is exactly what I've done by `cpupower idle-set -D 0`. It basically 
increases iops 2-3 times.


Pipelined writes were added in rocksdb 5.5.1 back in the summer of 
2017.  That wasn't available when bluestore was being written.


In fact ... it slightly confuses me because even now bluestore IS 
writing to rocksdb from multiple threads sometimes. It's when 
bluestore_sync_submit_transaction is on and the write doesn't require 
aio (= when it's a deferred write) and when it holds several other 
conditions. It calls db->submit_transaction from the tp_osd_tp thread 
then.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Safe to remove objects from default.rgw.meta ?

2019-03-12 Thread Dan van der Ster
Hi all,

We have an S3 cluster with >10 million objects in default.rgw.meta.

# radosgw-admin zone get | jq .metadata_heap
"default.rgw.meta"

In these old tickets I realized that this setting is obsolete, and
those objects are probably useless:
   http://tracker.ceph.com/issues/17256
   http://tracker.ceph.com/issues/18174

We will clear the metadata_heap setting in the zone json, but then can
we simply `rados rm` all the objects in the default.rgw.meta pool?

The objects seem to come in three flavours:

   .meta:user:dvanders:_KpWMw94jrX75PgAfhDymKTo:2
   .meta:bucket:atlas-eventservice:_byPmpJS9V9l7DULEVxlDC2A:1
   
.meta:bucket.instance:atlas-eventservice:61c59385-085d-4caa-9070-63a3868dccb6.3191998.599860:_PQCKPJVTzvtwgU41Dw0Cdx6:1

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson


On 3/12/19 8:40 AM, vita...@yourcmc.ru wrote:

One way or another we can only have a single thread sending writes to
rocksdb.  A lot of the prior optimization work on the write side was
to get as much processing out of the kv_sync_thread as possible.
That's still a worthwhile goal as it's typically what bottlenecks with
high amounts of concurrency.  What I think would be very interesting
though is if we moved more toward a model where we had lots shards
(OSDs or shards of an OSD) with independent rocksdb instances and less
threading overhead per shard.  That's the way the seastar work is
going, and also sort of the model I've been thinking about for a very
simple single-threaded OSD.


Doesn't rocksdb have pipelined writes? Isn't it better to just use 
that builtin concurrency instead of factoring in your own?



Pipelined writes were added in rocksdb 5.5.1 back in the summer of 
2017.  That wasn't available when bluestore was being written. We may be 
able to make use of it now but I don't think anyone has taken the time 
to figure out how much work it would take or what kind of benefit we 
would get.



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage - block.db useless? [solved]

2019-03-12 Thread Benjamin Zapiec
Yeah thank you xD

you just answered another thread where i asked for the kv-sync thread
And consider this done so i know what to do now.

Thank you



Am 12.03.19 um 14:43 schrieb Mark Nelson:
> Our default of 4 256MB WAL buffers is arguably already too big. On one
> hand we are making these buffers large to hopefully avoid short lived
> data going into the DB (pglog writes).  IE if a pglog write comes in and
> later a tombstone invalidating it comes in, we really want those to land
> in the same WAL log to avoid that write being propagated into the DB. 
> On the flip side, large buffers mean that there's more work that rocksdb
> has to perform to compare keys to get everything ordered.  This is done
> in the kv_sync_thread where we often bottleneck on small random write
> workloads:
> 
> 
>     | | |   |   |   | + 13.30%
> rocksdb::InlineSkipList const&>::Insert
> 
> So on one hand we want large buffers to avoid short lived data going
> into the DB, and on the other hand we want small buffers to avoid large
> amounts of comparisons eating CPU, especially in CPU limited environments.
> 
> 
> Mark
> 
> 
> 
> On 3/12/19 8:25 AM, Benjamin Zapiec wrote:
>> May I configure the size of WAL to increase block.db usage?
>> For example I configure 20GB I would get an usage of about 48GB on L3.
>>
>> Or should I stay with ceph defaults?
>> Is there a maximal size for WAL that makes sense?
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Mark Nelson
Our default of 4 256MB WAL buffers is arguably already too big. On one 
hand we are making these buffers large to hopefully avoid short lived 
data going into the DB (pglog writes).  IE if a pglog write comes in and 
later a tombstone invalidating it comes in, we really want those to land 
in the same WAL log to avoid that write being propagated into the DB.  
On the flip side, large buffers mean that there's more work that rocksdb 
has to perform to compare keys to get everything ordered.  This is done 
in the kv_sync_thread where we often bottleneck on small random write 
workloads:



    | | |   |   |   | + 13.30% 
rocksdb::InlineSkipListconst&>::Insert


So on one hand we want large buffers to avoid short lived data going 
into the DB, and on the other hand we want small buffers to avoid large 
amounts of comparisons eating CPU, especially in CPU limited environments.



Mark



On 3/12/19 8:25 AM, Benjamin Zapiec wrote:

May I configure the size of WAL to increase block.db usage?
For example I configure 20GB I would get an usage of about 48GB on L3.

Or should I stay with ceph defaults?
Is there a maximal size for WAL that makes sense?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Benjamin Zapiec
Sorry i mean L2


Am 12.03.19 um 14:25 schrieb Benjamin Zapiec:
> May I configure the size of WAL to increase block.db usage?
> For example I configure 20GB I would get an usage of about 48GB on L3.
> 
> Or should I stay with ceph defaults?
> Is there a maximal size for WAL that makes sense?
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Benjamin Zapiec  (System Engineer)
* GONICUS GmbH * Moehnestrasse 55 (Kaiserhaus) * D-59755 Arnsberg
* Tel.: +49 2932 916-0 * Fax: +49 2932 916-245
* http://www.GONICUS.de

* Sitz der Gesellschaft: Moehnestrasse 55 * D-59755 Arnsberg
* Geschaeftsfuehrer: Rainer Luelsdorf, Alfred Schroeder
* Vorsitzender des Beirats: Juergen Michels
* Amtsgericht Arnsberg * HRB 1968



Wir erfüllen unsere Informationspflichten zum Datenschutz gem. der
Artikel 13

und 14 DS-GVO durch Veröffentlichung auf unserer Internetseite unter:

https://www.gonicus.de/datenschutz oder durch Zusendung auf Ihre
formlose Anfrage.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif

One way or another we can only have a single thread sending writes to
rocksdb.  A lot of the prior optimization work on the write side was
to get as much processing out of the kv_sync_thread as possible. 
That's still a worthwhile goal as it's typically what bottlenecks with
high amounts of concurrency.  What I think would be very interesting
though is if we moved more toward a model where we had lots shards
(OSDs or shards of an OSD) with independent rocksdb instances and less
threading overhead per shard.  That's the way the seastar work is
going, and also sort of the model I've been thinking about for a very
simple single-threaded OSD.


Doesn't rocksdb have pipelined writes? Isn't it better to just use that 
builtin concurrency instead of factoring in your own?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread Mark Nelson


On 3/12/19 7:31 AM, vita...@yourcmc.ru wrote:

Decreasing the min_alloc size isn't always a win, but ican be in some
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasing the WAL buffers in rocksdb to reduce write
amplification).  Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last
time I looked.  It might be a good idea to drop it down to 4k now but
I think we need to be careful because there are tradeoffs.


I think it's all about your disks' latency. Deferred write is 1 
IO+sync and redirect-write is 2 IOs+syncs. So if your IO or sync is 
slow (like it is on HDDs and bad SSDs) then the deferred write is 
better in terms of latency. If your IO is fast then you're only 
bottlenecked by the OSD code itself eating a lot of CPU and then 
direct write may be better. By the way, I think OSD itself is way TOO 
slow currently (see below).



Don't disagree, bluestore's write path has gotten *really* complicated.




The idea I was talking about turned out to be only viable for HDD/slow 
SSDs and only for low iodepths. But the gain is huge - something 
between +50% iops to +100% iops (2x less latency). There is a stupid 
problem in current bluestore implementation which makes it do 2 
journal writes and FSYNCs instead of one for every incoming 
transaction. The details are here: https://tracker.ceph.com/issues/38559


The unnecessary commit is the BlueFS's WAL. All it's doing is 
recording the increased size of a RocksDB WAL file. Which obviously 
shouldn't be required with RocksDB as its default setting is 
"kTolerateCorruptedTailRecords". However, without this setting the WAL 
is not synced to the disk with every write because by some clever 
logic sync_file_range is called only with SYNC_FILE_RANGE_WRITE in the 
corresponding piece of code. Thus the OSD's database gets corrupted 
when you kill it with -9 and thus it's impossible to set 
`bluefs_preextend_wal_files` to true. And thus you get two writes and 
commits instead of one.


I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE - 
as I understand there is currently no benefit in doing this. It could 
be a benefit if RocksDB was writing journal in small parts and then 
doing a single sync - but it's always flushing the newly written part 
of a journal to disk as a whole.


The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFORE 
and SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc. 
My pull request is here: https://github.com/ceph/ceph/pull/26909 - 
I've tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes, 
it does increase single-thread iops on HDDs two times (!). After this 
change BlueStore becomes actually better than FileStore at least on HDDs.


Another way of fixing it would be to add an explicit bdev->flush at 
the end of the kv_sync_thread, after db->submit_transaction_sync(), 
and possibly remove the redundant sync_file_range at all. But then you 
must do the same in another place in _txc_state_proc, because it's 
also sometimes doing submit_transaction_sync(). In the end I 
personally think that to add flags to sync_file_range is better 
because a function named "submit_transaction_sync" should be in fact 
SYNC! It shouldn't require additional steps from the caller to make 
the data durable.



I'm glad you are peaking under the covers here. :)  There's a lot going 
on here, and it's not immediate obvious what the intent is and the 
failure conditions are.  I suspect the intent here was to error on the 
side of caution but we really need to document this better.  To be fair 
it's not just us, there's confusion and terribleness all the way up to 
the kernel and beyond.





Also I have a small funny test result to share.

I've created one OSD on my laptop on a loop device in a tmpfs (i.e. 
RAM), created 1 RBD image inside it and tested it with `fio 
-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the test 
I've turned off CPU power saving with `cpupower idle-set -D 0`.


The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500 
iops with -iodepth=128.
- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000 
iops with -iodepth=128.
- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000 
iops with -iodepth=128.


If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but that 
doesn't matter).

- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms

The conclusion is that bluestore is actually almost TWO TIMES slower 
than filestore in terms of pure latency, and the throughput is only 
slightly better.

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Benjamin Zapiec
May I configure the size of WAL to increase block.db usage?
For example I configure 20GB I would get an usage of about 48GB on L3.

Or should I stay with ceph defaults?
Is there a maximal size for WAL that makes sense?



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Mark Nelson


On 3/12/19 7:24 AM, Benjamin Zapiec wrote:

Hello,

i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block- 6TB



By default we currently use 4 256MB WAL buffers.  2GB should be enough, 
though in most cases you are better off just leaving it on block.db as 
you did below.




Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.



WAL will only use about 1GB of that FWIW




So everything should be "fine". But the block.db only contains
about 10GB of data.



If this is an RBD workload, that's quite possible as RBD tends to use 
far less metadata than RGW.





If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.



Data in the DB in general will suffer space amplification and it gets 
worse the more levels in rocksdb you have as multiple levels may have 
copies of the same data at different points in time.  The bigger issue 
is that currently an entire level has to fit on the DB device.  IE if 
level 0 takes 1GB, level 1 takes 10GB, level 2 takes 100GB, and level 3 
takes 1000GB, you will only get 0, 1 and 2 on block.db with 250GB.





I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.



The DB holds metadata for the images (and some metadata for bluestore).  
This is going to be a very small fraction of the overall data size but 
is really important.  Whenever we do a write to an object we first try 
to read some metadata about it (if it exists).  Having those read 
attempts happen quickly is really important to make sure that the write 
happens quickly.





Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?



See above.  Also, rocksdb periodically has to compact data and with lots 
of metadata (and as a result lots of levels) it can get pretty slow.  
Having rocksdb on fast storage helps speed that process up and avoid 
write stalls due to level0 compaction (higher level compaction can 
happen in alternate threads).





Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.



The DB is there primarily to store metadata.  RBD doesn't use a lot of 
space but may do a lot of reads from the DB if it can't keep all of the 
bluestore onodes in it's own in-memory cache (the kv cache).  RGW uses 
the DB much more heavily and in some cases you may see 40-50% space 
usage if you have tiny RGW objects (~4KB).  See this spreadsheet for 
more info:



https://drive.google.com/file/d/1Ews2WR-y5k3TMToAm0ZDsm7Gf_fwvyFw/view?usp=sharing


Mark




Any thoughts about this?


Best regards


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Chasing slow ops in mimic

2019-03-12 Thread Alex Litvak

I looked further into historic slow ops (thanks to some other posts on the 
list) and I am confused a bit with the following event

{
"description": "osd_repop(client.85322.0:86478552 7.1b e502/466 
7:d8d149b7:::rbd_data.ff7e3d1b58ba.0316:head v 502'10665506)",
"initiated_at": "2019-03-08 07:53:23.673807",
"age": 335669.547018,
"duration": 13.328475,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"events": [
{
"time": "2019-03-08 07:53:23.673807",
"event": "initiated"
},
{
"time": "2019-03-08 07:53:23.673807",
"event": "header_read"
},
{
"time": "2019-03-08 07:53:23.673808",
"event": "throttled"
},
{
"time": "2019-03-08 07:53:37.001601",
"event": "all_read"
},
{
"time": "2019-03-08 07:53:37.001643",
"event": "dispatched"
},
{
"time": "2019-03-08 07:53:37.001649",
"event": "queued_for_pg"
},
{
"time": "2019-03-08 07:53:37.001679",
"event": "reached_pg"
},
{
"time": "2019-03-08 07:53:37.001699",
"event": "started"
},
{
"time": "2019-03-08 07:53:37.002208",
"event": "commit_sent"
},
{
"time": "2019-03-08 07:53:37.002282",
"event": "done"
}
]
}
},

It just tell me throttled, nothing else.  What does throttled mean in this case?
I see some events where osd is waiting for response from its partners for a 
specific pg but while it can be attributed to a network issue, throttled ones 
are not a clear cut.

Appreciate any clues,

On 3/11/2019 4:26 PM, Alex Litvak wrote:

Hello Cephers,

I am trying to find the cause of multiple slow ops happened with my small 
cluster.  I have a 3 node  with 9 OSDs

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
128 GB RAM
Each OSD is SSD Intel DC-S3710 800GB
It runs mimic 13.2.2 in containers.

Cluster was operating normally for 4 month and then recently I had an outage 
with multiple VMs (RBD) showing

Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243812] INFO: task 
xfsaild/vda1:404 blocked for more than 120 seconds.
Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.243957] Not tainted 
4.19.5-1.el7.elrepo.x86_64 #1
Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244063] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  8 07:59:42 sbc12n2-chi.siptalk.com kernel: [140206.244181] xfsaild/vda1    
D    0   404  2 0x8000

After examining ceph logs, i found following entries in multiple OSDs
Mar  8 07:38:52 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08 07:38:52.299 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.148553.0:5996289 7.fe 
7:7f0ebfe2:::rbd_data.17bab2eb141f2.023d:head [stat,write 2588672~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
Mar  8 07:38:53 storage1n2-chi ceph-osd-run.sh[20939]: 2019-03-08 07:38:53.347 7fe0bdb8f700 -1 osd.13 502 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.148553.0:5996289 7.fe 
7:7f0ebfe2:::rbd_data.17bab2eb141f2.


Mar  8 07:43:05 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08 07:43:05.360 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.152215.0:7037343 7.1e 
7:78d776e4:::rbd_data.27e662eb141f2.0436:head [stat,write 393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502)
Mar  8 07:43:06 storage1n2-chi ceph-osd-run.sh[28089]: 2019-03-08 07:43:06.332 7f32536bd700 -1 osd.7 502 get_health_metrics reporting 2 slow ops, oldest is osd_op(client.152215.0:7037343 7.1e 
7:78d776e4:::rbd_data.27e662eb141f2.0436:head [stat,write 393216~16384] snapc 0=[] ondisk+write+known_if_redirected e502)


The messages were showing on all nodes and affecting several osds on each node.

The trouble started at approximately 07:30 am and end 30 minutes later. I have not seen any slow ops since then, nor VMs showed kernel hangups since then.  Here is my ceph status.  I also want to note 
that the load on the cluster was minimal at the time.  Please let me know where I could start looking as the cluster cannot be in production wi

Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread vitalif
Amount of the metadata depends on the amount of data. But RocksDB is 
only putting metadata to the fast storage when it thinks all metadata on 
the same level of the DB is going to fit there. So all sizes except 4, 
30, 286 GB are useless.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Benjamin Zapiec
Okay so i think i don't understand the mechanism of Ceph's RocksDB if
it should place data on block.db or not.

So the amount of data in block.db depends on the wal size?
I thought it depends on the objects saved to the storage.
In this case, say we have a 1GB file, it would have a size
of 10GB in L2.

But if it depends on the wal i would have the same benefit using
a block.db with the size of 30GB instead of 250GB. Is that correct?


Best regards

> block.db is very unlikely to ever grow to 250GB with a 6TB data device.
>
> However, there seems to be a funny "issue" with all block.db sizes
> except 4, 30, and 286 GB being useless, because RocksDB puts the data on
> the fast storage only if it thinks the whole LSM level will fit there.
> Ceph's RocksDB options set WAL to 1GB and leave the default
> max_bytes_for_level_base unchanged so it's 256MB. Multiplier is also
> left at 10. So WAL=1GB, L1=256MB, L2=2560MB, L3=25600MB. So RocksDB will
> put L2 to the block.db only if the block.db's size exceeds
> 1GB+256MB+2560MB (which rounds up to 4GB), and it will put L3 to the
> block.db only if its size exceeds 1GB+256MB+2560MB+25600MB = almost 30GB.
>
>> Hello,
>>
>> i was wondering about ceph block.db to be nearly empty and I started
>> to investigate.
>>
>> The recommendations from ceph are that block.db should be at least
>> 4% the size of block. So my OSD configuration looks like this:
>>
>> wal.db   - not explicit specified
>> block.db - 250GB of SSD storage
>> block- 6TB
>>
>> Since wal is written to block.db if not available i didn't configured
>> wal. With the size of 250GB we are slightly above 4%.
>>
>> So everything should be "fine". But the block.db only contains
>> about 10GB of data.
>>
>> If figured out that an object in block.db gets "amplified" so
>> the space consumption is much higher than the object itself
>> would need.
>>
>> I'm using ceph as storage backend for openstack and raw images
>> with a size of 10GB and more are common. So if i understand
>> this correct i have to consider that a 10GB images may
>> consume 100GB of block.db.
>>
>> Beside the facts that the image may have a size of 100G and
>> they are only used for initial reads unitl all changed
>> blocks gets written to a SSD-only pool i was question me
>> if i need a block.db and if it would be better to
>> save the amount of SSD space used for block.db and just
>> create a 10GB wal.db?
>>
>> Has anyone done this before? Anyone who had sufficient SSD space
>> but stick with wal.db to save SSD space?
>>
>> If i'm correct the block.db will never be used for huge images.
>> And even though it may be used for one or two images does this make
>> sense? The images are used initially to read all unchanged blocks from
>> it. After a while each VM should access the images pool less and
>> less due to the changes made in the VM.
>>
>>
>> Any thoughts about this?
>>
>>
>> Best regards
>>
>> --
>> Benjamin Zapiec  (System Engineer)
>> * GONICUS GmbH * Moehnestrasse 55 (Kaiserhaus) * D-59755 Arnsberg
>> * Tel.: +49 2932 916-0 * Fax: +49 2932 916-245
>> * http://www.GONICUS.de
>>
>> * Sitz der Gesellschaft: Moehnestrasse 55 * D-59755 Arnsberg
>> * Geschaeftsfuehrer: Rainer Luelsdorf, Alfred Schroeder
>> * Vorsitzender des Beirats: Juergen Michels
>> * Amtsgericht Arnsberg * HRB 1968
>>
>>
>>
>> Wir erfüllen unsere Informationspflichten zum Datenschutz gem. der
>> Artikel 13
>>
>> und 14 DS-GVO durch Veröffentlichung auf unserer Internetseite unter:
>>
>> https://www.gonicus.de/datenschutz oder durch Zusendung auf Ihre
>> formlose Anfrage.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Benjamin Zapiec  (System Engineer)
* GONICUS GmbH * Moehnestrasse 55 (Kaiserhaus) * D-59755 Arnsberg
* Tel.: +49 2932 916-0 * Fax: +49 2932 916-245
* http://www.GONICUS.de

* Sitz der Gesellschaft: Moehnestrasse 55 * D-59755 Arnsberg
* Geschaeftsfuehrer: Rainer Luelsdorf, Alfred Schroeder
* Vorsitzender des Beirats: Juergen Michels
* Amtsgericht Arnsberg * HRB 1968



Wir erfüllen unsere Informationspflichten zum Datenschutz gem. der
Artikel 13

und 14 DS-GVO durch Veröffentlichung auf unserer Internetseite unter:

https://www.gonicus.de/datenschutz oder durch Zusendung auf Ihre
formlose Anfrage.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread vitalif

block.db is very unlikely to ever grow to 250GB with a 6TB data device.

However, there seems to be a funny "issue" with all block.db sizes 
except 4, 30, and 286 GB being useless, because RocksDB puts the data on 
the fast storage only if it thinks the whole LSM level will fit there. 
Ceph's RocksDB options set WAL to 1GB and leave the default 
max_bytes_for_level_base unchanged so it's 256MB. Multiplier is also 
left at 10. So WAL=1GB, L1=256MB, L2=2560MB, L3=25600MB. So RocksDB will 
put L2 to the block.db only if the block.db's size exceeds 
1GB+256MB+2560MB (which rounds up to 4GB), and it will put L3 to the 
block.db only if its size exceeds 1GB+256MB+2560MB+25600MB = almost 
30GB.



Hello,

i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block- 6TB

Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.

So everything should be "fine". But the block.db only contains
about 10GB of data.

If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.

I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.

Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?

Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.


Any thoughts about this?


Best regards

--
Benjamin Zapiec  (System Engineer)
* GONICUS GmbH * Moehnestrasse 55 (Kaiserhaus) * D-59755 Arnsberg
* Tel.: +49 2932 916-0 * Fax: +49 2932 916-245
* http://www.GONICUS.de

* Sitz der Gesellschaft: Moehnestrasse 55 * D-59755 Arnsberg
* Geschaeftsfuehrer: Rainer Luelsdorf, Alfred Schroeder
* Vorsitzender des Beirats: Juergen Michels
* Amtsgericht Arnsberg * HRB 1968



Wir erfüllen unsere Informationspflichten zum Datenschutz gem. der
Artikel 13

und 14 DS-GVO durch Veröffentlichung auf unserer Internetseite unter:

https://www.gonicus.de/datenschutz oder durch Zusendung auf Ihre
formlose Anfrage.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimize bluestore for random write i/o

2019-03-12 Thread vitalif

Decreasing the min_alloc size isn't always a win, but ican be in some
cases.  Originally bluestore_min_alloc_size_ssd was set to 4096 but we
increased it to 16384 because at the time our metadata path was slow
and increasing it resulted in a pretty significant performance win
(along with increasing the WAL buffers in rocksdb to reduce write
amplification).  Since then we've improved the metadata path to the
point where at least on our test nodes performance is pretty close
between with min_alloc size = 16k and min_alloc size = 4k the last
time I looked.  It might be a good idea to drop it down to 4k now but
I think we need to be careful because there are tradeoffs.


I think it's all about your disks' latency. Deferred write is 1 IO+sync 
and redirect-write is 2 IOs+syncs. So if your IO or sync is slow (like 
it is on HDDs and bad SSDs) then the deferred write is better in terms 
of latency. If your IO is fast then you're only bottlenecked by the OSD 
code itself eating a lot of CPU and then direct write may be better. By 
the way, I think OSD itself is way TOO slow currently (see below).


The idea I was talking about turned out to be only viable for HDD/slow 
SSDs and only for low iodepths. But the gain is huge - something between 
+50% iops to +100% iops (2x less latency). There is a stupid problem in 
current bluestore implementation which makes it do 2 journal writes and 
FSYNCs instead of one for every incoming transaction. The details are 
here: https://tracker.ceph.com/issues/38559


The unnecessary commit is the BlueFS's WAL. All it's doing is recording 
the increased size of a RocksDB WAL file. Which obviously shouldn't be 
required with RocksDB as its default setting is 
"kTolerateCorruptedTailRecords". However, without this setting the WAL 
is not synced to the disk with every write because by some clever logic 
sync_file_range is called only with SYNC_FILE_RANGE_WRITE in the 
corresponding piece of code. Thus the OSD's database gets corrupted when 
you kill it with -9 and thus it's impossible to set 
`bluefs_preextend_wal_files` to true. And thus you get two writes and 
commits instead of one.


I don't know the exact idea behind doing only SYNC_FILE_RANGE_WRITE - as 
I understand there is currently no benefit in doing this. It could be a 
benefit if RocksDB was writing journal in small parts and then doing a 
single sync - but it's always flushing the newly written part of a 
journal to disk as a whole.


The simplest way to fix it is just to add SYNC_FILE_RANGE_WAIT_BEFORE 
and SYNC_FILE_RANGE_WAIT_AFTER to sync_file_range in KernelDevice.cc. My 
pull request is here: https://github.com/ceph/ceph/pull/26909 - I've 
tested this change with 13.2.4 Mimic and 14.1.0 Nautilus and yes, it 
does increase single-thread iops on HDDs two times (!). After this 
change BlueStore becomes actually better than FileStore at least on 
HDDs.


Another way of fixing it would be to add an explicit bdev->flush at the 
end of the kv_sync_thread, after db->submit_transaction_sync(), and 
possibly remove the redundant sync_file_range at all. But then you must 
do the same in another place in _txc_state_proc, because it's also 
sometimes doing submit_transaction_sync(). In the end I personally think 
that to add flags to sync_file_range is better because a function named 
"submit_transaction_sync" should be in fact SYNC! It shouldn't require 
additional steps from the caller to make the data durable.


Also I have a small funny test result to share.

I've created one OSD on my laptop on a loop device in a tmpfs (i.e. 
RAM), created 1 RBD image inside it and tested it with `fio 
-ioengine=rbd -direct=1 -bs=4k -rw=randwrite`. Before doing the test 
I've turned off CPU power saving with `cpupower idle-set -D 0`.


The results are:
- filestore: 2200 iops with -iodepth=1 (0.454ms average latency). 8500 
iops with -iodepth=128.
- bluestore: 1800 iops with -iodepth=1 (0.555ms average latency). 9000 
iops with -iodepth=128.
- memstore: 3000 iops with -iodepth=1 (0.333ms average latency). 11000 
iops with -iodepth=128.


If we can think of memstore being a "minimal possible /dev/null" then:
- OSD overhead is 1/3000 = 0.333ms (maybe slighly less, but that doesn't 
matter).

- filestore overhead is 1/2200-1/3000 = 0.121ms
- bluestore overhead is 1/1800-1/3000 = 0.222ms

The conclusion is that bluestore is actually almost TWO TIMES slower 
than filestore in terms of pure latency, and the throughput is only 
slightly better. How could it happen? How could a newly written store 
become two times slower than the old one? ) that's pretty annoying...


Could it be because bluestore is doing a lot of threading? I mean could 
it be because each write operation goes through 5 threads during its 
execution? (tp_osd_tp -> aio -> kv_sync_thread -> kv_finalize_thread -> 
finisher)? Maybe just remove aio and kv threads and process all 
operations directly in tp_osd_tp then?

___
ceph-users mailin

[ceph-users] Ceph block storage - block.db useless?

2019-03-12 Thread Benjamin Zapiec
Hello,

i was wondering about ceph block.db to be nearly empty and I started
to investigate.

The recommendations from ceph are that block.db should be at least
4% the size of block. So my OSD configuration looks like this:

wal.db   - not explicit specified
block.db - 250GB of SSD storage
block- 6TB

Since wal is written to block.db if not available i didn't configured
wal. With the size of 250GB we are slightly above 4%.

So everything should be "fine". But the block.db only contains
about 10GB of data.

If figured out that an object in block.db gets "amplified" so
the space consumption is much higher than the object itself
would need.

I'm using ceph as storage backend for openstack and raw images
with a size of 10GB and more are common. So if i understand
this correct i have to consider that a 10GB images may
consume 100GB of block.db.

Beside the facts that the image may have a size of 100G and
they are only used for initial reads unitl all changed
blocks gets written to a SSD-only pool i was question me
if i need a block.db and if it would be better to
save the amount of SSD space used for block.db and just
create a 10GB wal.db?

Has anyone done this before? Anyone who had sufficient SSD space
but stick with wal.db to save SSD space?

If i'm correct the block.db will never be used for huge images.
And even though it may be used for one or two images does this make
sense? The images are used initially to read all unchanged blocks from
it. After a while each VM should access the images pool less and
less due to the changes made in the VM.


Any thoughts about this?


Best regards

-- 
Benjamin Zapiec  (System Engineer)
* GONICUS GmbH * Moehnestrasse 55 (Kaiserhaus) * D-59755 Arnsberg
* Tel.: +49 2932 916-0 * Fax: +49 2932 916-245
* http://www.GONICUS.de

* Sitz der Gesellschaft: Moehnestrasse 55 * D-59755 Arnsberg
* Geschaeftsfuehrer: Rainer Luelsdorf, Alfred Schroeder
* Vorsitzender des Beirats: Juergen Michels
* Amtsgericht Arnsberg * HRB 1968



Wir erfüllen unsere Informationspflichten zum Datenschutz gem. der
Artikel 13

und 14 DS-GVO durch Veröffentlichung auf unserer Internetseite unter:

https://www.gonicus.de/datenschutz oder durch Zusendung auf Ihre
formlose Anfrage.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to attach permission policy to user?

2019-03-12 Thread Pritha Srivastava
What exact error are you seeing after adding admin caps?

I tried the following steps on master and they worked fine: (TESTER1 is
adding a user policy to TESTER)
1. radosgw-admin --uid TESTER --display-name "TestUser" --access_key TESTER
--secret test123 user create
2. radosgw-admin --uid TESTER1 --display-name "TestUser" --access_key
TESTER1 --secret test123 user create
3. radosgw-admin caps add --uid="TESTER1" --caps="user-policy=*"
4. s3curl.pl --debug --id admin -- -s -v -X POST "
http://localhost:8000/?Action=PutUserPolicy&PolicyName=Policy1&UserName=TESTER&PolicyDocument=\{\
"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Deny\",\"Action\":\"s3:*\",\"Resource\":\[\"*\"\],\"Condition\":\{\"BoolIfExists\":\{\"sts:authentication\":\"false\"\}\}\},\{\"Effect\":\"Allow\",\"Action\":\"sts:GetSessionToken\",\"Resource\":\"*\",\"Condition\":\{\"BoolIfExists\":\{\"sts:authentication\":\"false\"\}\}\}\]\}&Version=2010-05-08"

.s3curl is as follows for me:
%awsSecretAccessKeys = (
# personal account
admin => {
id => 'TESTER1',
key => 'test123',
}
);


On Tue, Mar 12, 2019 at 11:09 AM myxingkong  wrote:

> Hi Pritha:
> I added administrator quotas to users, but they didn't seem to work.
> radosgw-admin user create --uid=ADMIN --display-name=ADMIN --admin
> --system
> radosgw-admin caps add --uid="ADMIN"
> --caps="user-policy=*;roles=*;users=*;buckets=*;metadata=*;usage=*;zone=*"
> {
> "user_id": "ADMIN",
> "display_name": "ADMIN",
> "email": "",
> "suspended": 0,
> "max_buckets": 1000,
> "subusers": [],
> "keys": [
> {
> "user": "ADMIN",
> "access_key": "HTRJ1HIKR4FB9A24ZG9C",
> "secret_key": "Dfk7t5u4jvdyFMlEf8t4MTdBLEqVlru7tag1g8PE"
> }
> ],
> "swift_keys": [],
> "caps": [
> {
> "type": "buckets",
> "perm": "*"
> },
> {
> "type": "metadata",
> "perm": "*"
> },
> {
> "type": "roles",
> "perm": "*"
> },
> {
> "type": "usage",
> "perm": "*"
> },
> {
> "type": "user-policy",
> "perm": "*"
> },
> {
> "type": "users",
> "perm": "*"
> },
> {
> "type": "zone",
> "perm": "*"
> }
> ],
> "op_mask": "read, write, delete",
> "system": "true",
> "default_placement": "",
> "default_storage_class": "",
> "placement_tags": [],
> "bucket_quota": {
> "enabled": false,
> "check_on_raw": false,
> "max_size": -1,
> "max_size_kb": 0,
> "max_objects": -1
> },
> "user_quota": {
> "enabled": false,
> "check_on_raw": false,
> "max_size": -1,
> "max_size_kb": 0,
> "max_objects": -1
> },
> "temp_url_keys": [],
> "type": "rgw",
> "mfa_ids": []
> }
> Thanks,
> myxingkong
>
> *发件人:* Pritha Srivastava 
> *发送时间:* 2019-03-12 12:23:24
> *收件人:*  myxingkong 
> *抄送:*  ceph-users@lists.ceph.com
> *主题:* Re: [ceph-users] How to attach permission policy to user?
>
> Hi Myxingkong,
>
> Did you add admin caps to the user (with access key id
> 'HTRJ1HIKR4FB9A24ZG9C'), which is trying to attach a user policy. using the
> command below:
>
> radosgw-admin caps add --uid= --caps="user-policy=*"
>
> Thanks,
> Pritha
>
> On Tue, Mar 12, 2019 at 7:19 AM myxingkong  wrote:
>
>> Hi Pritha:
>> I was unable to attach the permission policy through S3curl, which
>> returned an HTTP 403 error.
>>
>> ./s3curl.pl --id admin -- -s -v -X POST "
>> http://192.168.199.81:7480/?Action=PutUserPolicy&PolicyName=Policy1&UserName=TESTER&PolicyDocument=\{\"Version\":\"2012-10-17\",\"Statement\":\[\{\"Effect\":\"Deny\",\"Action\":\"s3:*\",\"Resource\":\[\"*\"\],\"Condition\":\{\"BoolIfExists\":\{\"sts:authentication\":\"false\"\}\}\},\{\"Effect\":\"Allow\",\"Action\":\"sts:GetSessionToken\",\"Resource\":\"*\",\"Condition\":\{\"BoolIfExists\":\{\"sts:authentication\":\"false\"\}\}\}\]\}&Version=2010-05-08
>> "
>> Request:
>> > POST
>> /?Action=PutUserPolicy&PolicyName=Policy1&UserName=TESTER&PolicyDocument={"Version":"2012-10-17","Statement":[{"Effect":"Deny","Action":"s3:*","Resource":["*"],"Condition":{"BoolIfExists":{"sts:authentication":"false"}}},{"Effect":"Allow","Action":"sts:GetSessionToken","Resource":"*","Condition":{"BoolIfExists":{"sts:authentication":"false"}}}]}&Version=2010-05-08
>> HTTP/1.1
>> > User-Agent: curl/7.29.0
>> > Host: 192.168.199.81:7480
>> > Accept: */*
>> > Date: Tue, 12 Mar 2019 01:39:55 GMT
>> > Authorization: AWS HTRJ1HIKR4FB9A24ZG9C:FTMBoc7+sJf0K+cx+nYD7Sdj2Xg=
>> Response:
>> < HTTP/1.1 403 Forbidden
>> < Content-Length: 187
>> < x-amz-request-id: tx00144-005c870deb-4a92d-default
>> < Accept-Ranges: bytes
>> < Content-Type: application/xml
>> < Date: Tue, 12 Mar 2019 01:39:55 GMT
>> <
>> * Connection #

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi Kevin,

I'm sure the firewalld are disabled on each host.

Well, the network is not a problem. The servers are connected
to the same switch and the connection is good when the osds
are marked as down. There was no interruption or delay.

I restart the leader monitor daemon and it seems return to the
normal state.

Thanks.

Kevin Olbrich  于2019年3月12日周二 下午5:44写道:

> Are you sure that firewalld is stopped and disabled?
> Looks exactly like that when I missed one host in a test cluster.
>
> Kevin
>
>
> Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou <
> deader...@gmail.com>:
>
>> Hi,
>>
>> I deployed a ceph cluster with good performance. But the logs
>> indicate that the cluster is not as stable as I think it should be.
>>
>> The log shows the monitors mark some osd as down periodly:
>> [image: image.png]
>>
>> I didn't find any useful information in osd logs.
>>
>> ceph version 13.2.4 mimic (stable)
>> OS version CentOS 7.6.1810
>> kernel version 5.0.0-2.el7
>>
>> Thanks.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Kevin Olbrich
Are you sure that firewalld is stopped and disabled?
Looks exactly like that when I missed one host in a test cluster.

Kevin


Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou :

> Hi,
>
> I deployed a ceph cluster with good performance. But the logs
> indicate that the cluster is not as stable as I think it should be.
>
> The log shows the monitors mark some osd as down periodly:
> [image: image.png]
>
> I didn't find any useful information in osd logs.
>
> ceph version 13.2.4 mimic (stable)
> OS version CentOS 7.6.1810
> kernel version 5.0.0-2.el7
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Scale Ceph for Large Numbers of Clients?

2019-03-12 Thread Stefan Kooman
Quoting Zack Brenton (z...@imposium.com):
> Types of devices:
> We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3
> Mons, and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs
> with the following resources:
> - osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk
> - mds:  3 CPU, 24Gi RAM
> - mons: 500m (.5) CPU, 1Gi RAM

Hmm, 6 GiB of RAM is not a whole lot. Especially if you are going to
increase the amount of OSDs (partitions) like Patrick suggested. By
default it will take 4 GiB per OSD ... Make sure you set the
"osd_memory_target" parameter accordingly [1].

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/?highlight=osd%20memory%20target


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd_recovery_tool not working on Luminous 12.2.11

2019-03-12 Thread Mateusz Skała
Hi,
I have problem with starting two of my OSD’s with error:

osd.19 pg_epoch: 8887 pg[1.2b5(unlocked)] enter Initial
0> 2019-03-01 09:41:30.259485 7f303486be00 -1 
/build/ceph-12.2.11/src/osd/PGLog.h: In function 'static void 
PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const 
pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&, bool, 
bool*, const DoutPrefixProvider*, std::set >*, 
bool) [with missing_type = pg_missing_set; std::ostringstream = 
std::__cxx11::basic_ostringstream]' thread 7f303486be00 time 2019-03-01 
09:41:30.257055
/build/ceph-12.2.11/src/osd/PGLog.h: 1355: FAILED assert(last_e.version.version 
< e.version.version)

I’m trying to recover 3 RBD images from this cluster using rbd_recovery_tool 
but this tool not working for my Luminous cluster. After reading code of 
rbd_recovery_tool I found that in my version of ceph-kvstore-tool I need to 
specify database type. I think that also output of this command is changed.
After running command manual I get this output:

ceph-kvstore-tool rocksdb /var/lib/ceph/osd/ceph-19/current/omap get 
_HOBJTOSEQ_ ...head.1.02B5

(_HOBJTOSEQ_, ...head.1.02B5)
  02 01 81 00 00 00 a9 00  00 00 00 00 00 00 00 00  ||
0010  00 00 00 00 00 00 01 00  00 00 00 00 00 00 02 00  ||
0020  01 01 12 00 00 00 01 00  00 00 00 00 00 00 00 00  ||
0030  00 00 00 ff ff ff ff ff  fe ff ff ff ff ff ff ff  ||
0040  06 03 2b 00 00 00 00 00  00 00 00 00 00 00 fe ff  |..+.|
0050  ff ff ff ff ff ff b5 02  00 00 00 00 00 00 00 01  ||
0060  00 00 00 00 00 00 00 ff  ff ff ff ff ff ff ff ff  ||
0070  00 01 01 10 00 00 00 00  00 00 00 00 00 00 00 00  ||
0080  00 00 00 00 00 00 00  |...|
0087

In src of rbd_recovery_tool I found this comment:

# ceph-kvstore-tool get result like this:
# 02 01 7e 00 00 00 12 44 00 00 00 00 00 00 00 00
# get header seq bytes: 
# 12 44 00 00 00 00 00 00 
# -> 00 00 00 00 00 00 44 12 
# echo $((16#4412)) -> 17426 == header_seq

Output is different so rbd_recovery_tool can’t parse it.
How can I fix this code? Where I can find my „header_seq” for this pg?

Regards
Mateusz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount cephfs on ceph servers

2019-03-12 Thread Hector Martin
It's worth noting that most containerized deployments can effectively 
limit RAM for containers (cgroups), and the kernel has limits on how 
many dirty pages it can keep around.


In particular, /proc/sys/vm/dirty_ratio (default: 20) means at most 20% 
of your total RAM can be dirty FS pages. If you set up your containers 
such that the cumulative memory usage is capped below, say, 70% of RAM, 
then this might effectively guarantee that you will never hit this issue.


On 08/03/2019 02:17, Tony Lill wrote:

AFAIR the issue is that under memory pressure, the kernel will ask
cephfs to flush pages, but that this in turn causes the osd (mds?) to
require more memory to complete the flush (for network buffers, etc). As
long as cephfs and the OSDs are feeding from the same kernel mempool,
you are susceptible. Containers don't protect you, but a full VM, like
xen or kvm? would.

So if you don't hit the low memory situation, you will not see the
deadlock, and you can run like this for years without a problem. I have.
But you are most likely to run out of memory during recovery, so this
could compound your problems.

On 3/7/19 3:56 AM, Marc Roos wrote:
  


Container =  same kernel, problem is with processes using the same
kernel.






-Original Message-
From: Daniele Riccucci [mailto:devs...@posteo.net]
Sent: 07 March 2019 00:18
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mount cephfs on ceph servers

Hello,
is the deadlock risk still an issue in containerized deployments? For
example with OSD daemons in containers and mounting the filesystem on
the host machine?
Thank you.

Daniele

On 06/03/19 16:40, Jake Grimmett wrote:

Just to add "+1" on this datapoint, based on one month usage on Mimic
13.2.4 essentially "it works great for us"

Prior to this, we had issues with the kernel driver on 12.2.2. This
could have been due to limited RAM on the osd nodes (128GB / 45 OSD),
and an older kernel.

Upgrading the RAM to 256GB and using a RHEL 7.6 derived kernel has
allowed us to reliably use the kernel driver.

We keep 30 snapshots ( one per day), have one active metadata server,
and change several TB daily - it's much, *much* faster than with fuse.

Cluster has 10 OSD nodes, currently storing 2PB, using ec 8:2 coding.

ta ta

Jake




On 3/6/19 11:10 AM, Hector Martin wrote:

On 06/03/2019 12:07, Zhenshi Zhou wrote:

Hi,

I'm gonna mount cephfs from my ceph servers for some reason,
including monitors, metadata servers and osd servers. I know it's
not a best practice. But what is the exact potential danger if I
mount cephfs from its own server?


As a datapoint, I have been doing this on two machines (single-host
Ceph
clusters) for months with no ill effects. The FUSE client performs a
lot worse than the kernel client, so I switched to the latter, and
it's been working well with no deadlocks.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Yep, I think it maybe a network issue as well. I'll check the connections.

Thanks Eugen:)

Eugen Block  于2019年3月12日周二 下午4:35写道:

> Hi,
>
> my first guess would be a network issue. Double-check your connections
> and make sure the network setup works as expected. Check syslogs,
> dmesg, switches etc. for hints that a network interruption may have
> occured.
>
> Regards,
> Eugen
>
>
> Zitat von Zhenshi Zhou :
>
> > Hi,
> >
> > I deployed a ceph cluster with good performance. But the logs
> > indicate that the cluster is not as stable as I think it should be.
> >
> > The log shows the monitors mark some osd as down periodly:
> > [image: image.png]
> >
> > I didn't find any useful information in osd logs.
> >
> > ceph version 13.2.4 mimic (stable)
> > OS version CentOS 7.6.1810
> > kernel version 5.0.0-2.el7
> >
> > Thanks.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Eugen Block

Hi,

my first guess would be a network issue. Double-check your connections  
and make sure the network setup works as expected. Check syslogs,  
dmesg, switches etc. for hints that a network interruption may have  
occured.


Regards,
Eugen


Zitat von Zhenshi Zhou :


Hi,

I deployed a ceph cluster with good performance. But the logs
indicate that the cluster is not as stable as I think it should be.

The log shows the monitors mark some osd as down periodly:
[image: image.png]

I didn't find any useful information in osd logs.

ceph version 13.2.4 mimic (stable)
OS version CentOS 7.6.1810
kernel version 5.0.0-2.el7

Thanks.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi,

I deployed a ceph cluster with good performance. But the logs
indicate that the cluster is not as stable as I think it should be.

The log shows the monitors mark some osd as down periodly:
[image: image.png]

I didn't find any useful information in osd logs.

ceph version 13.2.4 mimic (stable)
OS version CentOS 7.6.1810
kernel version 5.0.0-2.el7

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel D3-S4610 performance

2019-03-12 Thread Kai Wembacher
Hi everyone,

I have an Intel D3-S4610 SSD with 1.92 TB here for testing and get some pretty 
bad numbers, when running the fio benchmark suggested by Sébastien Han 
(http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/):

Intel D3-S4610 1.92 TB
--numjobs=1 write: IOPS=3860, BW=15.1MiB/s (15.8MB/s)(905MiB/60001msec)
--numjobs=2 write: IOPS=7138, BW=27.9MiB/s (29.2MB/s)(1673MiB/60001msec)
--numjobs=4 write: IOPS=12.5k, BW=48.7MiB/s (51.0MB/s)(2919MiB/60002msec)

Compared to our current Samsung SM863 SSDs the Intel one is about 6x slower.

Has someone here tested this SSD and can give me some values for comparison?

Many thanks in advance,

Kai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com