[ceph-users] OSDs stuck in preboot with log msgs about "osdmap fullness state needs update"

2019-01-29 Thread Subhachandra Chandra
Hello,

I have a bunch of OSDs stuck in the preboot stage with the following
log messages while recovering from an outage. The following flags are set
on the cluster

flags nodown,noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub

   How do we get these OSDs back to active state? Or will turning off
nodown or norecover get them back up?


2019-01-29 19:26:38.866134 7fc7e6682700 -1 osd.116 244652 osdmap fullness
state needs update

2019-01-29 19:26:40.370466 7fc7e6682700 -1 osd.116 244653 osdmap fullness
state needs update

2019-01-29 19:26:41.746553 7fc7e6682700 -1 osd.116 244654 osdmap fullness
state needs update


2019-01-29 19:26:38.934123 7fd91c6bb700 -1 osd.357 244652 osdmap fullness
state needs update

2019-01-29 19:26:40.473567 7fd91c6bb700 -1 osd.357 244653 osdmap fullness
state needs update

2019-01-29 19:26:41.776754 7fd91c6bb700 -1 osd.357 244654 osdmap fullness
state needs update



Thanks

Chandra

-- 


This email message, including attachments, may contain private, 
proprietary, or privileged information and is the confidential information 
and/or property of GRAIL, Inc., and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
strictly prohibited. If you are not the intended recipient, please contact 
the sender by reply email and destroy all copies of the original message.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Difference between OSD lost vs rm

2019-01-16 Thread Subhachandra Chandra
Hello,

What is the difference between marking an OSD "lost" vs removing it with
"rm" in terms of cluster recovery?

What is the next step after marking an OSD "lost" and the cluster finishes
recovering? Do you then "rm" it?

Thanks
Chandra

-- 


This email message, including attachments, may contain private, 
proprietary, or privileged information and is the confidential information 
and/or property of GRAIL, Inc., and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
strictly prohibited. If you are not the intended recipient, please contact 
the sender by reply email and destroy all copies of the original message.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs busy reading from Bluestore partition while bringing up nodes.

2019-01-12 Thread Subhachandra Chandra
I looked at the daemon status as Paul suggested. Will set "nodown noup"
while bringing up the next set of OSDs.

The busy OSDs seem to be doing a lot of ops notifying PGs in "Unknown"
state where this OSD was part of the PG in the past. Each of these seems to
take about 5-6 minutes before going on to the next unknown PG. Is there a
way to defer these ops till all the OSDs come up or reduce the amount of
time before giving up.

root@data5:/# ceph daemon osd.86 ops

{

"ops": [

{

"description": "pg_notify((query:213266 sent:213266 2.46ecs8( v
58049'556 (0'0,58049'556] local-lis/les=60851/60862 n=256 ec=1008/1008
lis/c 60828/60181 les/c/f 60833/60182/0 212117/212129/212129)
8->0)=([60181,212128] intervals=([60411,60413] acting
45(3),134(2),149(4),152(6),167(0),189(8),271(7),445(1),471(5)),([60822,60827]
acting
45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1),471(5)),([60828,60839]
acting
6(5),45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1)),([60851,60865]
acting
6(5),45(3),85(2),149(4),152(6),167(0),189(8),271(7),445(1)),([60880,60891]
acting
6(5),45(3),85(2),113(6),149(4),167(0),370(8),445(1),538(7)),([60933,61522]
acting
6(5),45(3),86(0),113(6),122(2),149(4),370(8),379(7),445(1)),([77024,80134]
acting
45(3),86(0),113(6),134(2),149(4),189(8),271(7),471(5)),([181819,186559]
acting 6(5),86(0),113(6),134(2),149(4),189(8),271(7))) epoch 213266)",

"initiated_at": "2019-01-12 18:14:17.922479",

"age": 301.306727,

"duration": 301.306757,

"type_data": {

"flag_point": "started",

"events": [

{

"time": "2019-01-12 18:14:17.922479",

"event": "initiated"

},

{

"time": "2019-01-12 18:19:13.521073",

"event": "started"

}

]

}

}

],

"num_ops": 1

}


root@data5:/# ceph daemon osd.86 ops

{

"ops": [

{

"description": "pg_notify((query:213291 sent:213291 2.46ecs8( v
58049'556 (0'0,58049'556] local-lis/les=60851/60862 n=256 ec=1008/1008
lis/c 60828/60181 les/c/f 60833/60182/0 212117/212129/212129)
8->0)=([60181,212128] intervals=([60411,60413] acting
45(3),134(2),149(4),152(6),167(0),189(8),271(7),445(1),471(5)),([60822,60827]
acting
45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1),471(5)),([60828,60839]
acting
6(5),45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1)),([60851,60865]
acting
6(5),45(3),85(2),149(4),152(6),167(0),189(8),271(7),445(1)),([60880,60891]
acting
6(5),45(3),85(2),113(6),149(4),167(0),370(8),445(1),538(7)),([60933,61522]
acting
6(5),45(3),86(0),113(6),122(2),149(4),370(8),379(7),445(1)),([77024,80134]
acting
45(3),86(0),113(6),134(2),149(4),189(8),271(7),471(5)),([181819,186559]
acting 6(5),86(0),113(6),134(2),149(4),189(8),271(7))) epoch 213291)",

"initiated_at": "2019-01-12 18:19:23.596664",

"age": 231.202152,

"duration": 231.202176,

"type_data": {

"flag_point": "started",

"events": [

{

"time": "2019-01-12 18:19:23.596664",

"event": "initiated"

},

{

"time": "2019-01-12 18:23:10.334700",

"event": "started"

}

]

}

}

],

"num_ops": 1

}


Thanks

Chandra

On Fri, Jan 11, 2019 at 10:24 PM Paul Emmerich 
wrote:

> This seems like a case of accumulating lots of new osd maps.
>
> What might help is also setting the noup and nodown flags and wait for
> the OSDs to start up. Use the "status" daemon command to check the
> current OSD state even if it can't come up in the cluster map due to
> noup (it also somewhere shows if it's behind on osd maps IIRC).
> Once they are all running you should be able to take them up again.
>
> This behavior got better with recent Mimic versions -- so I'd also
> recommend to upgrade *after* everything is back to healthy.
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Sat, Jan 12, 2019 at 3:56 AM Subhachandra Chandra
>  wrote:

[ceph-users] OSDs busy reading from Bluestore partition while bringing up nodes.

2019-01-11 Thread Subhachandra Chandra
Hi,

We have a cluster with 9 hosts and 540 HDDs using Bluestore and
containerized OSDs running luminous 12.2.4. While trying to add new nodes,
the cluster collapsed as it could not keep up with establishing
enough tcp connections. We fixed sysctl to be able to handle more
connections and also recycle tw sockets faster. Currently, as we are trying
to restart the cluster by bringing up a few OSDs at a time, some of the
OSDs get very busy after around 360 of them come up. iostats show that the
busy OSDs are constantly reading from the Bluestore partition. The number
of busy OSDs per node vary and norecover is set with no active clients. OSD
logs don't show anything other than cephx: verify_authorizer errors which
happen on both busy and idle OSDs and doesn't seem to be related to drive
reads.

  How can we figure out why the OSDs are busy reading from the drives? If
it is some kind of recovery, is there a way to track progress? Output of
ceph -s and logs from a busy and idle OSD are copied below.

Thanks
Chandra

Uptime stats with load averages show variance across the9 older nodes.

 02:43:44 up 19:21,  0 users,  load average: 0.88, 1.03, 1.06

 02:43:44 up  7:58,  0 users,  load average: 16.91, 13.49, 12.43

 02:43:44 up 1 day, 14 min,  0 users,  load average: 7.67, 6.70, 6.35

 02:43:45 up  7:01,  0 users,  load average: 84.40, 84.20, 83.73

 02:43:45 up  6:40,  1 user,  load average: 17.08, 17.40, 20.05

 02:43:45 up 19:46,  0 users,  load average: 15.58, 11.93, 11.44

 02:43:45 up 20:39,  0 users,  load average: 7.88, 6.50, 5.69

 02:43:46 up 1 day,  1:20,  0 users,  load average: 5.03, 3.81, 3.49

 02:43:46 up 1 day, 58 min,  0 users,  load average: 0.62, 1.00, 1.38


Ceph Config

--

[global]

cluster network = 192.168.13.0/24

fsid = <>

mon host = 172.16.13.101,172.16.13.102,172.16.13.103

mon initial members = ctrl1,ctrl2,ctrl3

mon_max_pg_per_osd = 750

mon_osd_backfillfull_ratio = 0.92

mon_osd_down_out_interval = 900

mon_osd_full_ratio = 0.95

mon_osd_nearfull_ratio = 0.85

osd_crush_chooseleaf_type = 3

osd_heartbeat_grace = 900

mon_osd_laggy_max_interval = 900

osd_max_pg_per_osd_hard_ratio = 1.0

public network = 172.16.13.0/24


[mon]

mon_compact_on_start = true


[osd]

osd_deep_scrub_interval = 2419200

osd_deep_scrub_stride = 4194304

osd_max_backfills = 10

osd_max_object_size = 276824064

osd_max_scrubs = 1

osd_max_write_size = 264

osd_pool_erasure_code_stripe_unit = 2097152

osd_recovery_max_active = 10

osd_heartbeat_interval = 15


Data nodes Sysctl params

-

fs.aio-max-nr=1048576

kernel.pid_max=4194303

kernel.threads-max=2097152

net.core.netdev_max_backlog=65536

net.core.optmem_max=1048576

net.core.rmem_max=8388608

net.core.rmem_default=8388608

net.core.somaxconn=2048

net.core.wmem_max=8388608

net.core.wmem_default=8388608

vm.max_map_count=524288

vm.min_free_kbytes=262144


net.ipv4.tcp_tw_reuse=1

net.ipv4.tcp_max_syn_backlog=16384

net.ipv4.tcp_fin_timeout=10

net.ipv4.tcp_slow_start_after_idle=0



Ceph -s output

---

root@ctrl1:/# ceph -s

  cluster:

id: 06126476-6deb-4baa-b7ca-50f5ccfacb68

health: HEALTH_ERR

noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
flag(s) set

704 osds down

9 hosts (540 osds) down

71 nearfull osd(s)

2 pool(s) nearfull

780664/74163111 objects misplaced (1.053%)

7724/8242239 objects unfound (0.094%)

396 PGs pending on creation

Reduced data availability: 32597 pgs inactive, 29764 pgs down,
820 pgs peering, 74 pgs incomplete, 1 pg stale

Degraded data redundancy: 679158/74163111 objects degraded
(0.916%), 1250 pgs degraded, 1106 pgs undersized

33 slow requests are blocked > 32 sec

9 stuck requests are blocked > 4096 sec

mons ctrl1,ctrl2,ctrl3 are using a lot of disk space



  services:

mon: 3 daemons, quorum ctrl1,ctrl2,ctrl3

mgr: ctrl1(active), standbys: ctrl2, ctrl3

osd: 1080 osds: 376 up, 1080 in; 1963 remapped pgs

 flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub



  data:

pools:   2 pools, 33280 pgs

objects: 8049k objects, 2073 TB

usage:   2277 TB used, 458 TB / 2736 TB avail

pgs: 3.585% pgs unknown

 94.363% pgs not active

 679158/74163111 objects degraded (0.916%)

 780664/74163111 objects misplaced (1.053%)

 7724/8242239 objects unfound (0.094%)

 29754 down

 1193  unknown

 535   peering

 496   activating+undersized+degraded+remapped

 284   remapped+peering

 258   active+undersized+degraded+remapped

 161   activating+degraded+remapped

 143   active+recovering+undersized+degraded+remapped

 89active+undersized+degraded

 76

Re: [ceph-users] pg count question

2018-08-10 Thread Subhachandra Chandra
The % should be based on how much of the storage you expect that pool to
take up out of total available. 256PGs with Replication 3 will distribute
themselves as 256 * 3 / 14 which will be about 54 per OSD. For the smaller
pool 16 seems too low.  You can go with 32 and 256 if you want lower number
of PGs in the vms pool and expand later. The calculator recommends 32 and
512 for your settings.

Subhachandra




On Fri, Aug 10, 2018 at 8:43 AM, Satish Patel  wrote:

> Folks,
>
>
> I used your link to calculate PGs and i did following.
>
> Total OSD: 14
> Replica: 3
> Total Pools: 2  ( Images & vms)  In %Data i gave 5% to images & 95% to
> vms (openstack)
>
> https://ceph.com/pgcalc/
>
> It gave me following result
>
> vms  -  512 PG
> images - 16 PG
>
> For safe side i set vms 256 PG is that a good idea because you can
> increase pg but you can't reduce PG so i want to start with smaller
> and later i have room to increase pg, i just don't want to commit
> bigger which cause other performance issue, Do you think my approach
> is right or i should set 512 now
>
> On Fri, Aug 10, 2018 at 9:23 AM, Satish Patel 
> wrote:
> > Re-sending it, because i found my i lost membership so wanted to make
> > sure, my email went through
> >
> > On Fri, Aug 10, 2018 at 7:07 AM, Satish Patel 
> wrote:
> >> Thanks,
> >>
> >> Can you explain about %Data field in that calculation, is this total
> data
> >> usage for specific pool or total ?
> >>
> >> For example
> >>
> >> Pool-1 is small so should I use 20%
> >> Pool-2 is bigger so I should use 80%
> >>
> >> I'm confused there so can you give me just example how to calculate that
> >> field?
> >>
> >> Sent from my iPhone
> >>
> >> On Aug 9, 2018, at 4:25 PM, Subhachandra Chandra  >
> >> wrote:
> >>
> >> I have used the calculator at https://ceph.com/pgcalc/ which looks at
> >> relative sizes of pools and makes a suggestion.
> >>
> >> Subhachandra
> >>
> >> On Thu, Aug 9, 2018 at 1:11 PM, Satish Patel 
> wrote:
> >>>
> >>> Thanks Subhachandra,
> >>>
> >>> That is good point but how do i calculate that PG based on size?
> >>>
> >>> On Thu, Aug 9, 2018 at 1:42 PM, Subhachandra Chandra
> >>>  wrote:
> >>> > If pool1 is going to be much smaller than pool2, you may want more
> PGs
> >>> > in
> >>> > pool2 for better distribution of data.
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Aug 8, 2018 at 12:40 AM, Sébastien VIGNERON
> >>> >  wrote:
> >>> >>
> >>> >> The formula seems correct for a 100 pg/OSD target.
> >>> >>
> >>> >>
> >>> >> > Le 8 août 2018 à 04:21, Satish Patel  a
> écrit :
> >>> >> >
> >>> >> > Thanks!
> >>> >> >
> >>> >> > Do you have any comments on Question: 1 ?
> >>> >> >
> >>> >> > On Tue, Aug 7, 2018 at 10:59 AM, Sébastien VIGNERON
> >>> >> >  wrote:
> >>> >> >> Question 2:
> >>> >> >>
> >>> >> >> ceph osd pool set-quota  max_objects|max_bytes 
> >>> >> >> set object or byte limit on pool
> >>> >> >>
> >>> >> >>
> >>> >> >>> Le 7 août 2018 à 16:50, Satish Patel  a
> écrit
> >>> >> >>> :
> >>> >> >>>
> >>> >> >>> Folks,
> >>> >> >>>
> >>> >> >>> I am little confused so just need clarification, I have 14 osd
> in
> >>> >> >>> my
> >>> >> >>> cluster and i want to create two pool  (pool-1 & pool-2) how do
> i
> >>> >> >>> device pg between two pool with replication 3
> >>> >> >>>
> >>> >> >>> Question: 1
> >>> >> >>>
> >>> >> >>> Is this correct formula?
> >>> >> >>>
> >>> >> >>> 14 * 100 / 3 / 2 =  233  ( power of 2 would be 256)
> >>> >> >>>
> >>> >> >>> So should i give 256 PG per pool right?
> >>> >> >>>
> >>> >> >>> pool-1 = 256 pg & pgp
> >>> >> >>> poo-2 = 256 pg & pgp
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Question: 2
> >>> >> >>>
> >>> >> >>> How do i set limit on pool for example if i want pool-1 can only
> >>> >> >>> use
> >>> >> >>> 500GB and pool-2 can use rest of the space?
> >>> >> >>> ___
> >>> >> >>> ceph-users mailing list
> >>> >> >>> ceph-users@lists.ceph.com
> >>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >> >>
> >>> >>
> >>> >> ___
> >>> >> ceph-users mailing list
> >>> >> ceph-users@lists.ceph.com
> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >
> >>> >
> >>
> >>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg count question

2018-08-09 Thread Subhachandra Chandra
I have used the calculator at https://ceph.com/pgcalc/ which looks at
relative sizes of pools and makes a suggestion.

Subhachandra

On Thu, Aug 9, 2018 at 1:11 PM, Satish Patel  wrote:

> Thanks Subhachandra,
>
> That is good point but how do i calculate that PG based on size?
>
> On Thu, Aug 9, 2018 at 1:42 PM, Subhachandra Chandra
>  wrote:
> > If pool1 is going to be much smaller than pool2, you may want more PGs in
> > pool2 for better distribution of data.
> >
> >
> >
> >
> > On Wed, Aug 8, 2018 at 12:40 AM, Sébastien VIGNERON
> >  wrote:
> >>
> >> The formula seems correct for a 100 pg/OSD target.
> >>
> >>
> >> > Le 8 août 2018 à 04:21, Satish Patel  a écrit :
> >> >
> >> > Thanks!
> >> >
> >> > Do you have any comments on Question: 1 ?
> >> >
> >> > On Tue, Aug 7, 2018 at 10:59 AM, Sébastien VIGNERON
> >> >  wrote:
> >> >> Question 2:
> >> >>
> >> >> ceph osd pool set-quota  max_objects|max_bytes 
> >> >> set object or byte limit on pool
> >> >>
> >> >>
> >> >>> Le 7 août 2018 à 16:50, Satish Patel  a
> écrit :
> >> >>>
> >> >>> Folks,
> >> >>>
> >> >>> I am little confused so just need clarification, I have 14 osd in my
> >> >>> cluster and i want to create two pool  (pool-1 & pool-2) how do i
> >> >>> device pg between two pool with replication 3
> >> >>>
> >> >>> Question: 1
> >> >>>
> >> >>> Is this correct formula?
> >> >>>
> >> >>> 14 * 100 / 3 / 2 =  233  ( power of 2 would be 256)
> >> >>>
> >> >>> So should i give 256 PG per pool right?
> >> >>>
> >> >>> pool-1 = 256 pg & pgp
> >> >>> poo-2 = 256 pg & pgp
> >> >>>
> >> >>>
> >> >>> Question: 2
> >> >>>
> >> >>> How do i set limit on pool for example if i want pool-1 can only use
> >> >>> 500GB and pool-2 can use rest of the space?
> >> >>> ___
> >> >>> ceph-users mailing list
> >> >>> ceph-users@lists.ceph.com
> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg count question

2018-08-09 Thread Subhachandra Chandra
If pool1 is going to be much smaller than pool2, you may want more PGs in
pool2 for better distribution of data.




On Wed, Aug 8, 2018 at 12:40 AM, Sébastien VIGNERON <
sebastien.vigne...@criann.fr> wrote:

> The formula seems correct for a 100 pg/OSD target.
>
>
> > Le 8 août 2018 à 04:21, Satish Patel  a écrit :
> >
> > Thanks!
> >
> > Do you have any comments on Question: 1 ?
> >
> > On Tue, Aug 7, 2018 at 10:59 AM, Sébastien VIGNERON
> >  wrote:
> >> Question 2:
> >>
> >> ceph osd pool set-quota  max_objects|max_bytes 
> set object or byte
> limit on pool
> >>
> >>
> >>> Le 7 août 2018 à 16:50, Satish Patel  a écrit :
> >>>
> >>> Folks,
> >>>
> >>> I am little confused so just need clarification, I have 14 osd in my
> >>> cluster and i want to create two pool  (pool-1 & pool-2) how do i
> >>> device pg between two pool with replication 3
> >>>
> >>> Question: 1
> >>>
> >>> Is this correct formula?
> >>>
> >>> 14 * 100 / 3 / 2 =  233  ( power of 2 would be 256)
> >>>
> >>> So should i give 256 PG per pool right?
> >>>
> >>> pool-1 = 256 pg & pgp
> >>> poo-2 = 256 pg & pgp
> >>>
> >>>
> >>> Question: 2
> >>>
> >>> How do i set limit on pool for example if i want pool-1 can only use
> >>> 500GB and pool-2 can use rest of the space?
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel/Luminous Filestore/Bluestore for a new cluster

2018-06-04 Thread Subhachandra Chandra
We have been running a Luminous.04 + Bluestor for about 3 months in
production. All the daemons run as docker containers and were installed
using ceph-ansible. 540 spinning drives with journal/wal/db on the same
drive spread across 9 hosts. Using librados object interface directly with
steady 100MB/s writes to it.

We have not observed any major issues. We have had occasional OSD daemon
crashes due to an assert which is a known bug but the cluster recovered
without any intervention each time. All the nodes have been rebooted 2-3
times due to CoreOS updates and no issues with that either.

If you have any specific questions related to the cluster, please post them
on this thread.

Subhachandra

On Wed, May 30, 2018 at 1:06 PM, Simon Ironside 
wrote:

> On 30/05/18 20:35, Jack wrote:
>
>> Why would you deploy a Jewel cluster, which is almost 3 majors versions
>> away ?
>> Bluestore is also the good answer
>> It works well, have many advantages, and is simply the future of Ceph
>>
>
> Indeed, and normally I wouldn't even ask, but as I say there's been some
> comments/threads recently that make me doubt the obvious Luminous +
> Bluestore path. A few that stand out in my memory are:
>
> * "Useless due to http://tracker.ceph.com/issues/22102; [1]
> * OSD crash with segfault Luminous 12.2.4 [2] [3] [4]
>
> There are others but those two stuck out for me. I realise that people
> will generally only report problems rather than "I installed ceph and
> everything went fine!" stories to this list but it was enough to motivate
> me to ask if Luminous/Bluestore was considered a good choice for a fresh
> install or if I should wait a bit.
>
> Thanks,
> Simon.
>
> [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-
> May/026339.html
> [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-
> March/025373.html
> [3] http://tracker.ceph.com/issues/23431
> [4] http://tracker.ceph.com/issues/23352
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] amount of PGs/pools/OSDs for your openstack / Ceph

2018-04-09 Thread Subhachandra Chandra
Our use case is not Openstack but we have a cluster with similar size to
what you are looking at. Our cluster has 540 OSDs with 4PB of raw storage
spread across 9 nodes at this point.

2 pools
   - 512 PGs - 3 way redundancy
   - 32768 PGs - RS(6,3) erasure coding (99.9% of data in this pool)

The reason we chose to go with ~550PGs/OSD currently is to reduce the
number of data moves that will happen when OSDs are added to the cluster
and the number of PGs need to be expanded. We have enough memory on the
nodes to handle the high number of PGs. 512GB for 60 OSDs/ node. For
testing the cluster about 2.5TB of data was written to the EC pool using
"rados bench" at 2-3GB/s of sustained throughput. The cluster is being used
with librados and objects are directly stored in the pools. Did not hit any
major issues with simulated scenarios like drive replacement and recovery.

We also tested with double the number of PGs in each pool - 1024 and 65536.
The cluster started showing instability at that point. Whenever an OSD went
down, cascading failures started to occur during recovery i.e more OSDs
would fail during the peering process when a failed OSD tried to rejoin the
cluster.

Keeping the OSD usage balanced becomes very important as the cluster fills
up. A few OSDs that have much higher usage than the others can stop all
writes into the cluster and it is very hard to recover from it when the
usage is very close to the capacity thresholds.

Subhachandra


On Sat, Apr 7, 2018 at 7:01 PM, Christian Wuerdig <
christian.wuer...@gmail.com> wrote:

> The general recommendation is to target around 100 PG/OSD. Have you tried
> the https://ceph.com/pgcalc/ tool?
>
> On Wed, 4 Apr 2018 at 21:38, Osama Hasebou  wrote:
>
>> Hi Everyone,
>>
>> I would like to know what kind of setup had the Ceph community been using
>> for their Openstack's Ceph configuration when it comes to number of Pools &
>> OSDs and their PGs.
>>
>> Ceph documentation briefly mentions it for small cluster size, and I
>> would like to know from your experience, how much PGs have you created for
>> your openstack pools in reality for a ceph cluster ranging from 1-2 PB
>> capacity or 400-600 number of OSDs that performs well without issues.
>>
>> Hope to hear from you!
>>
>> Thanks.
>>
>> Regards,
>> Ossi
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random individual OSD failures with "connection refused reported by" another OSD?

2018-03-28 Thread Subhachandra Chandra
We have seen similar behavior when there are network issues. AFAIK, the OSD
is being reported down by an OSD that cannot reach it. But either another
OSD that can reach it or the heartbeat between the OSD and the monitor
declares it up. The OSD "boot" message does not seem to indicate an actual
OSD restart.

Subhachandra

On Wed, Mar 28, 2018 at 10:30 AM, Andre Goree  wrote:

> Hello,
>
> I've recently had a minor issue come up where random individual OSDs are
> failed due to a connection refused on another OSD.  I say minor, bc it's
> not a node-wide issue, and appears to be random nodes -- and besides that,
> the OSD comes up within less than a second, as if the OSD is sent a
> "restart," or something.
>
> On the MON I see this (notice the entire time is ~0.77s):
>
> 2018-03-26 22:39:36.821247 mon.mon-01 [INF] osd.77 failed
> (root=default,host=osd-05) (connection refused reported by osd.55)
> 2018-03-26 22:39:36.935445 mon.mon-01 [WRN] Health check failed: 1 osds
> down (OSD_DOWN)
> 2018-03-26 22:39:39.959496 mon.mon-01 [WRN] Health check failed: Reduced
> data availability: 6 pgs peering (PG_AVAILABILITY)
> 2018-03-26 22:39:41.969578 mon.mon-01 [WRN] Health check failed: Degraded
> data redundancy: 6528/1742880 objects degraded (0.375%), 46 pgs degraded
> (PG_DEGRADED)
> 2018-03-26 22:39:43.978429 mon.mon-01 [INF] Health check cleared:
> PG_AVAILABILITY (was: Reduced data availability: 6 pgs peering)
> 2018-03-26 22:39:48.913112 mon.mon-01 [WRN] Health check update: Degraded
> data redundancy: 19411/1742880 objects degraded (1.114%), 136 pgs degraded
> (PG_DEGRADED)
> 2018-03-26 22:40:06.288138 mon.mon-01 [INF] Health check cleared: OSD_DOWN
> (was: 1 osds down)
> 2018-03-26 22:40:06.301955 mon.mon-01 [INF] osd.77
> 172.16.238.18:6818/57264 boot
> 2018-03-26 22:40:07.298884 mon.mon-01 [WRN] Health check update: Degraded
> data redundancy: 17109/1742880 objects degraded (0.982%), 120 pgs degraded
> (PG_DEGRADED)
> 2018-03-26 22:40:13.330362 mon.mon-01 [INF] Health check cleared:
> PG_DEGRADED (was: Degraded data redundancy: 5605/1742880 objects degraded
> (0.322%), 39 pgs degraded)
> 2018-03-26 22:40:13.330409 mon.mon-01 [INF] Cluster is now healthy
>
>
> On host osd-05 (which hosts osd.77) there appears to be normal heartbeat
> traffic before the OSD spontaneously reboots (truncated for brevity):
>
> 2018-03-26 22:33:00.773897 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >>
> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 73 vs existing csq=73
> existing_state=STATE_STANDBY
> 2018-03-26 22:33:00.774124 7efcaf20f700  0 -- 172.16.239.18:6818/7788 >>
> 172.16.239.21:6818/8675 conn(0x55c5ea2d9000 :6818
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 74 vs existing csq=73
> existing_state=STATE_STANDBY
> 2018-03-26 22:39:56.832556 7f80ceea8e00  0 set uid:gid to 64045:64045
> (ceph:ceph)
> 2018-03-26 22:39:56.832576 7f80ceea8e00  0 ceph version 12.2.4
> (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process
> (unknown), pid 57264
> 2018-03-26 22:39:56.849487 7f80ceea8e00  0 pidfile_write: ignore empty
> --pid-file
> 2018-03-26 22:39:56.859045 7f80ceea8e00  0 load: jerasure load: lrc load:
> isa
> 2018-03-26 22:39:56.859122 7f80ceea8e00  1 bdev create path
> /var/lib/ceph/osd/ceph-77/block type kernel
> 2018-03-26 22:39:56.859135 7f80ceea8e00  1 bdev(0x5568ccae4d80
> /var/lib/ceph/osd/ceph-77/block) open path /var/lib/ceph/osd/ceph-77/block
> 2018-03-26 22:39:56.859398 7f80ceea8e00  1 bdev(0x5568ccae4d80
> /var/lib/ceph/osd/ceph-77/block) open size 8001559724032 (0x7470220,
> 7452 GB) block_size 4096 (4096 B) rotational
> 2018-03-26 22:39:56.859711 7f80ceea8e00  1 
> bluestore(/var/lib/ceph/osd/ceph-77)
> _set_cache_sizes max 0.5 < ratio 0.99
> 2018-03-26 22:39:56.859733 7f80ceea8e00  1 
> bluestore(/var/lib/ceph/osd/ceph-77)
> _set_cache_sizes cache_size 1073741824 meta 0.5 kv 0.5 data 0
> 2018-03-26 22:39:56.859738 7f80ceea8e00  1 bdev(0x5568ccae4d80
> /var/lib/ceph/osd/ceph-77/block) close
> 2018-03-26 22:39:57.132071 7f80ceea8e00  1 
> bluestore(/var/lib/ceph/osd/ceph-77)
> _mount path /var/lib/ceph/osd/ceph-77
> 2018-03-26 22:39:57.132534 7f80ceea8e00  1 bdev create path
> /var/lib/ceph/osd/ceph-77/block type kernel
> ...truncated...
>
>
> Same on host osd-07 (which hosts osd.55, the one that reported connection
> refused), you'll see normal heartbeat traffic, followed by something
> interesting, before normal heartbeat traffic returns:
>
> 2018-03-26 22:33:20.598576 7f495c52e700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.21:6816/8668 conn(0x5619e70c9800 :6810
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=0).handle_connect_msg accept connect_seq 42 vs existing csq=41
> existing_state=STATE_STANDBY
> 2018-03-26 22:39:36.974204 7f495cd2f700  0 -- 172.16.239.20:6810/7206 >>
> 172.16.239.17:6805/6991 

[ceph-users] How to persist configuration about enabled mgr plugins in Luminous 12.2.4

2018-03-23 Thread Subhachandra Chandra
Hi,

   We used ceph-ansible to install/update our Ceph cluster config where all
the cph dameons run as containers. In mgr.yml I have the following config

###

# MODULES #

###

# Ceph mgr modules to enable, current modules available are:
status,dashboard,localpool,restful,zabbix,prometheus,influx

ceph_mgr_modules: [status,dashboard,prometheus]

In Luminous.2, when the MGR container restarted, the mgr daemon used to
reload the plugins. Since I upgraded to Luminous.4, the mgr daemon has
stopped reloading the plugins and need me to run "ceph mgr module enable
" to load them. What changed between the two versions in how the
manager is configured. Is there a config file that can be used to specify
the plugins to load at mgr start time? Looking at ceph-ansible, it looks
like during installation it just runs the "module enable" commands and
somehow that used to work.

Thanks
Subhachandra
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Subhachandra Chandra
Looking at the latency numbers in this thread, it seems to be a cut-through
switch.

Subhachandra

On Wed, Mar 21, 2018 at 12:58 PM, Subhachandra Chandra <
schan...@grailbio.com> wrote:

> Latency is a concern if your application is sending one packet at a time
> and waiting for a reply. If you are streaming large blocks of data, the
> first packet is delayed by the network latency but after that you will
> receive a 10Gbps stream continuously. The latency for jumbo frames vs 1500
> byte frames depends upon the switch type. On a cut-through switch there is
> very little difference but on a store-and-forward switch it will be
> proportional to packet size. Most modern switching ASICs are capable of
> cut-through operation.
>
> Subhachandra
>
> On Wed, Mar 21, 2018 at 7:15 AM, Willem Jan Withagen <w...@digiware.nl>
> wrote:
>
>> On 21-3-2018 13:47, Paul Emmerich wrote:
>> > Hi,
>> >
>> > 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
>> > DAC connections should be faster: switches are typically in the range of
>> > ~500ns to 1µs.
>> >
>> >
>> > But you'll find that this small difference in latency induced by the
>> > switch will be quite irrelevant in the grand scheme of things when using
>> > the Linux network stack...
>>
>> But I think it does when people start to worry about selecting High
>> clock speed CPUS versus packages with more cores...
>>
>> 900ns is quite a lot if you have that mindset.
>> And probably 1800ns at that, because the delay will be a both ends.
>> Or perhaps even 3600ns because the delay is added at every ethernet
>> connector???
>>
>> But I'm inclined to believe you that the network stack could take quite
>> some time...
>>
>>
>> --WjW
>>
>>
>> > Paul
>> >
>> > 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen <w...@digiware.nl
>> > <mailto:w...@digiware.nl>>:
>> >
>> > Hi,
>> >
>> > I just ran into this table for a 10G Netgear switch we use:
>> >
>> > Fiberdelays:
>> > 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
>> > 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
>> > 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
>> > 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
>> >
>> > Copperdelays:
>> > 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
>> > 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
>> > 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
>> > 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
>> >
>> > Fiberdelays:
>> > 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
>> > 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
>> > 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
>> > 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
>> >
>> > Copperdelays:
>> > 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
>> > 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
>> > 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
>> > 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
>> >
>> > So the difference is serious: 900ns on a total of 1900ns for a 10G
>> > pakket.
>> > Other strange thing is that 1K packets are slower than 1518 bytes.
>> >
>> > So that might warrant connecting boxes preferably with optics
>> > instead of CAT cableing if you are trying to squeeze the max out of
>> > a setup.
>> >
>> > Sad thing is that they do not report for jumbo frames, and doing
>> these
>> > measurements your self is not easy...
>> >
>> > --WjW
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> >
>> >
>> >
>> >
>> > --
>> > --
>> > Paul Emmerich
>> >
>> > croit GmbH
>> > Freseniusstr. 31h
>> > 81247 München
>> > www.croit.io <http://www.croit.io>
>> > Tel: +49 89 1896585 90
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Difference in speed on Copper of Fiber ports on switches

2018-03-21 Thread Subhachandra Chandra
Latency is a concern if your application is sending one packet at a time
and waiting for a reply. If you are streaming large blocks of data, the
first packet is delayed by the network latency but after that you will
receive a 10Gbps stream continuously. The latency for jumbo frames vs 1500
byte frames depends upon the switch type. On a cut-through switch there is
very little difference but on a store-and-forward switch it will be
proportional to packet size. Most modern switching ASICs are capable of
cut-through operation.

Subhachandra

On Wed, Mar 21, 2018 at 7:15 AM, Willem Jan Withagen 
wrote:

> On 21-3-2018 13:47, Paul Emmerich wrote:
> > Hi,
> >
> > 2.3µs is a typical delay for a 10GBASE-T connection. But fiber or SFP+
> > DAC connections should be faster: switches are typically in the range of
> > ~500ns to 1µs.
> >
> >
> > But you'll find that this small difference in latency induced by the
> > switch will be quite irrelevant in the grand scheme of things when using
> > the Linux network stack...
>
> But I think it does when people start to worry about selecting High
> clock speed CPUS versus packages with more cores...
>
> 900ns is quite a lot if you have that mindset.
> And probably 1800ns at that, because the delay will be a both ends.
> Or perhaps even 3600ns because the delay is added at every ethernet
> connector???
>
> But I'm inclined to believe you that the network stack could take quite
> some time...
>
>
> --WjW
>
>
> > Paul
> >
> > 2018-03-21 12:16 GMT+01:00 Willem Jan Withagen  > >:
> >
> > Hi,
> >
> > I just ran into this table for a 10G Netgear switch we use:
> >
> > Fiberdelays:
> > 10 Gbps vezelvertraging (64 bytepakketten): 1.827 µs
> > 10 Gbps vezelvertraging (512 bytepakketten): 1.919 µs
> > 10 Gbps vezelvertraging (1024 bytepakketten): 1.971 µs
> > 10 Gbps vezelvertraging (1518 bytepakketten): 1.905 µs
> >
> > Copperdelays:
> > 10 Gbps kopervertraging (64 bytepakketten): 2.728 µs
> > 10 Gbps kopervertraging (512 bytepakketten): 2.85 µs
> > 10 Gbps kopervertraging (1024 bytepakketten): 2.904 µs
> > 10 Gbps kopervertraging (1518 bytepakketten): 2.841 µs
> >
> > Fiberdelays:
> > 1 Gbps vezelvertraging (64 bytepakketten) 2.289 µs
> > 1 Gbps vezelvertraging (512 bytepakketten) 2.393 µs
> > 1 Gbps vezelvertraging (1024 bytepakketten) 2.423 µs
> > 1 Gbps vezelvertraging (1518 bytepakketten) 2.379 µs
> >
> > Copperdelays:
> > 1 Gbps kopervertraging (64 bytepakketten) 2.707 µs
> > 1 Gbps kopervertraging (512 bytepakketten) 2.821 µs
> > 1 Gbps kopervertraging (1024 bytepakketten) 2.866 µs
> > 1 Gbps kopervertraging (1518 bytepakketten) 2.826 µs
> >
> > So the difference is serious: 900ns on a total of 1900ns for a 10G
> > pakket.
> > Other strange thing is that 1K packets are slower than 1518 bytes.
> >
> > So that might warrant connecting boxes preferably with optics
> > instead of CAT cableing if you are trying to squeeze the max out of
> > a setup.
> >
> > Sad thing is that they do not report for jumbo frames, and doing
> these
> > measurements your self is not easy...
> >
> > --WjW
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> >
> >
> >
> >
> > --
> > --
> > Paul Emmerich
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io 
> > Tel: +49 89 1896585 90
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash with segfault Luminous 12.2.4

2018-03-08 Thread Subhachandra Chandra
I noticed a similar crash too. Unfortunately, I did not get much info in
the logs.

 *** Caught signal (Segmentation fault) **

Mar 07 17:58:26 data7 ceph-osd-run.sh[796380]:  in thread 7f63a0a97700
thread_name:safe_timer

Mar 07 17:58:28 data7 ceph-osd-run.sh[796380]: docker_exec.sh: line 56:
797138 Segmentation fault  (core dumped) "$@"


Thanks

Subhachandra


On Thu, Mar 8, 2018 at 6:00 AM, Dietmar Rieder 
wrote:

> Hi,
>
> I noticed in my client (using cephfs) logs that an osd was unexpectedly
> going down.
> While checking the osd logs for the affected OSD I found that the osd
> was seg faulting:
>
> []
> 2018-03-07 06:01:28.873049 7fd9af370700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7fd9af370700 thread_name:safe_timer
>
>   ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
> luminous (stable)
>1: (()+0xa3c611) [0x564585904611]
> 2: (()+0xf5e0) [0x7fd9b66305e0]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> [...]
>
> Should I open a ticket for this? What additional information is needed?
>
>
> I put the relevant log entries for download under [1], so maybe someone
> with more
> experience can find some useful information therein.
>
> Thanks
>   Dietmar
>
>
> [1] https://expirebox.com/download/6473c34c80e8142e22032469a59df555.html
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-01 Thread Subhachandra Chandra
Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives
filled to around 90%. One thing that does increase memory usage is the
number of clients simultaneously sending write requests to a particular
primary OSD if the write sizes are large.

Subhachandra

On Thu, Mar 1, 2018 at 6:18 AM, David Turner  wrote:

> With default memory settings, the general rule is 1GB ram/1TB OSD.  If you
> have a 4TB OSD, you should plan to have at least 4GB ram.  This was the
> recommendation for filestore OSDs, but it was a bit much memory for the
> OSDs.  From what I've seen, this rule is a little more appropriate with
> bluestore now and should still be observed.
>
> Please note that memory usage in a HEALTH_OK cluster is not the same
> amount of memory that the daemons will use during recovery.  I have seen
> deployments with 4x memory usage during recovery.
>
> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman  wrote:
>
>> Quoting Caspar Smit (caspars...@supernas.eu):
>> > Stefan,
>> >
>> > How many OSD's and how much RAM are in each server?
>>
>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12
>> cores (at least one core per OSD).
>>
>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM
>> right?
>>
>> Apparently. Sure they will use more RAM than just cache to function
>> correctly. I figured 3 GB per OSD would be enough ...
>>
>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of
>> total
>> > RAM. The cache is a part of the memory usage by bluestore OSD's.
>>
>> A factor 4 is quite high, isn't it? Where is all this RAM used for
>> besides cache? RocksDB?
>>
>> So how should I size the amount of RAM in a OSD server for 10 bluestore
>> SSDs in a
>> replicated setup?
>>
>> Thanks,
>>
>> Stefan
>>
>> --
>> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
>> | GPG: 0xD14839C6   +31 318 648 688
>> <+31%20318%20648%20688> / i...@bit.nl
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to increase number of PGs throws "Error E2BIG" though PGs/OSD < mon_max_pg_per_osd

2018-01-12 Thread Subhachandra Chandra
Thank you for the explanation, Brad. I will change that setting and see how
it goes.

Subhachandra

On Thu, Jan 11, 2018 at 10:38 PM, Brad Hubbard <bhubb...@redhat.com> wrote:

> On Fri, Jan 12, 2018 at 11:27 AM, Subhachandra Chandra
> <schan...@grailbio.com> wrote:
> > Hello,
> >
> >  We are running experiments on a Ceph cluster before we move data on
> it.
> > While trying to increase the number of PGs on one of the pools it threw
> the
> > following error
> >
> > root@ctrl1:/# ceph osd pool set data pg_num 65536
> > Error E2BIG: specified pg_num 65536 is too large (creating 32768 new PGs
> on
> > ~540 OSDs exceeds per-OSD max of 32)
>
> That comes from here:
>
> https://github.com/ceph/ceph/blob/5d7813f612aea59239c8375aaa0091
> 9ae32f952f/src/mon/OSDMonitor.cc#L6027
>
> So the warning is triggered because new_pgs (65536) >
> g_conf->mon_osd_max_split_count (32) * expected_osds (540)
>
> >
> > There are 2 pools named "data" and "metadata". "data" is an erasure coded
> > pool (6,3) and "metadata" is a replicated pool with a replication factor
> of
> > 3.
> >
> > root@ctrl1:/# ceph osd lspools
> > 1 metadata,2 data,
> > root@ctrl1:/# ceph osd pool get metadata pg_num
> > pg_num: 512
> > root@ctrl1:/# ceph osd pool get data pg_num
> > pg_num: 32768
> >
> > osd: 540 osds: 540 up, 540 in
> >  flags noout,noscrub,nodeep-scrub
> >
> >   data:
> > pools:   2 pools, 33280 pgs
> > objects: 7090k objects, 1662 TB
> > usage:   2501 TB used, 1428 TB / 3929 TB avail
> > pgs: 33280 active+clean
> >
> > The current PG/OSD ratio according to my calculation should be 549
> >>>> (32768 * 9 + 512 * 3 ) / 540.0
> > 548.97778
> >
> > Increasing the number of PGs in the "data" pool should increase the
> PG/OSD
> > ratio to about 1095
> >>>> (65536 * 9 + 512 * 3 ) / 540.0
> > 1095.
> >
> > In the config, settings related to PG/OSD ratio look like
> > mon_max_pg_per_osd = 1500
> > osd_max_pg_per_osd_hard_ratio = 1.0
> >
> > Trying to increase the number of PGs to 65536 throws the previously
> > mentioned error. The new PG/OSD ratio is still under the configured
> limit.
> > Why do we see the error? Further, there seems to be a bug in the error
> > message where it says "exceeds per-OSD max of 32" in terms of where does
> > "32" comes from?
>
> Maybe the wording could be better. Perhaps "exceeds per-OSD max with
> mon_osd_max_split_count of 32". I'll submit this and see how it goes.
>
> >
> > P.S. I understand that the PG/OSD ratio configured on this cluster far
> > exceeds the recommended values. The experiment is to find scaling limits
> and
> > try out expansion scenarios.
> >
> > Thanks
> > Subhachandra
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Cheers,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Trying to increase number of PGs throws "Error E2BIG" though PGs/OSD < mon_max_pg_per_osd

2018-01-11 Thread Subhachandra Chandra
Hello,

 We are running experiments on a Ceph cluster before we move data on
it. While trying to increase the number of PGs on one of the pools it threw
the following error

root@ctrl1:/# ceph osd pool set data pg_num 65536
Error E2BIG: specified pg_num 65536 is too large (creating 32768 new PGs on
~540 OSDs exceeds per-OSD max of 32)

There are 2 pools named "data" and "metadata". "data" is an erasure coded
pool (6,3) and "metadata" is a replicated pool with a replication factor of
3.

root@ctrl1:/# ceph osd lspools
1 metadata,2 data,
root@ctrl1:/# ceph osd pool get metadata pg_num
pg_num: 512
root@ctrl1:/# ceph osd pool get data pg_num
pg_num: 32768

osd: 540 osds: 540 up, 540 in
 flags noout,noscrub,nodeep-scrub

  data:
pools:   2 pools, 33280 pgs
objects: 7090k objects, 1662 TB
usage:   2501 TB used, 1428 TB / 3929 TB avail
pgs: 33280 active+clean

The current PG/OSD ratio according to my calculation should be 549
>>> (32768 * 9 + 512 * 3 ) / 540.0
548.97778

Increasing the number of PGs in the "data" pool should increase the PG/OSD
ratio to about 1095
>>> (65536 * 9 + 512 * 3 ) / 540.0
1095.

In the config, settings related to PG/OSD ratio look like
mon_max_pg_per_osd = 1500
osd_max_pg_per_osd_hard_ratio = 1.0

Trying to increase the number of PGs to 65536 throws the previously
mentioned error. The new PG/OSD ratio is still under the configured limit.
Why do we see the error? Further, there seems to be a bug in the error
message where it says "exceeds per-OSD max of 32" in terms of where does
"32" comes from?

P.S. I understand that the PG/OSD ratio configured on this cluster far
exceeds the recommended values. The experiment is to find scaling limits
and try out expansion scenarios.

Thanks
Subhachandra
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The way to minimize osd memory usage?

2017-12-11 Thread Subhachandra Chandra
I ran an experiment with 1GB memory per OSD using Bluestore. 12.2.2 made a
big difference.

In addition, you should have a look at your max object size. It looks like
you will see a jump in memory usage if a particular OSD happens to be the
primary for a number of objects being written in parallel. In our case
reducing the number of clients reduced memory requirements. Reducing max
object size should also reduce memory requirements on the OSD daemon.

Subhachandra



On Sun, Dec 10, 2017 at 1:01 PM,  wrote:

> Send ceph-users mailing list submissions to
> ceph-users@lists.ceph.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> or, via email, send a message with subject or body 'help' to
> ceph-users-requ...@lists.ceph.com
>
> You can reach the person managing the list at
> ceph-users-ow...@lists.ceph.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
>
> Today's Topics:
>
>1. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
>2. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
>3. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
>4. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
>5. The way to minimize osd memory usage? (shadow_lin)
>6. Re: The way to minimize osd memory usage? (Konstantin Shalygin)
>7. Re: The way to minimize osd memory usage? (shadow_lin)
>8. Random checksum errors (bluestore on Luminous) (Martin Preuss)
>9. Re: The way to minimize osd memory usage? (David Turner)
>   10. what's the maximum number of OSDs per OSD server? (Igor Mendelev)
>   11. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
>   12. Re: what's the maximum number of OSDs per OSD server?
>   (Igor Mendelev)
>   13. Re: RBD+LVM -> iSCSI -> VMWare (He?in Ejdesgaard M?ller)
>   14. Re: Random checksum errors (bluestore on Luminous) (Martin Preuss)
>   15. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
>
>
> --
>
> Message: 1
> Date: Sun, 10 Dec 2017 00:26:39 +
> From: Donny Davis 
> To: Brady Deetz 
> Cc: Aaron Glenn , ceph-users
> 
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
>  mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Just curious but why not just use a hypervisor with rbd support? Are there
> VMware specific features you are reliant on?
>
> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz  wrote:
>
> > I'm testing using RBD as VMWare datastores. I'm currently testing with
> > krbd+LVM on a tgt target hosted on a hypervisor.
> >
> > My Ceph cluster is HDD backed.
> >
> > In order to help with write latency, I added an SSD drive to my
> hypervisor
> > and made it a writeback cache for the rbd via LVM. So far I've managed to
> > smooth out my 4k write latency and have some pleasing results.
> >
> > Architecturally, my current plan is to deploy an iSCSI gateway on each
> > hypervisor hosting that hypervisor's own datastore.
> >
> > Does anybody have any experience with this kind of configuration,
> > especially with regard to LVM writeback caching combined with RBD?
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL:  attachments/20171210/4f055103/attachment-0001.html>
>
> --
>
> Message: 2
> Date: Sat, 9 Dec 2017 18:56:53 -0600
> From: Brady Deetz 
> To: Donny Davis 
> Cc: Aaron Glenn , ceph-users
> 
> Subject: Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare
> Message-ID:
> 

Re: [ceph-users] Memory leak in OSDs running 12.2.1 beyond the buffer_anon mempool leak

2017-12-07 Thread Subhachandra Chandra
It looks like 12.2.2 fixed the memory leak with bluestore. I don't see the
fast leak anymore. Will monitor for any slow leaks.

Thanks
Subhachandra



On Tue, Dec 5, 2017 at 9:37 AM, Subhachandra Chandra <schan...@grailbio.com>
wrote:

> That is what I will try today. I tried "filestore" with 12.2.1 and did not
> see any issues. Will repeat the experiment with "bluestore" and 12.2.2.
>
> Thanks
> Subhachandra
>
> On Tue, Dec 5, 2017 at 5:14 AM, Konstantin Shalygin <k0...@k0ste.ru>
> wrote:
>
>>   We are trying out Ceph on a small cluster and are observing memory
>>> leakage in the OSD processes.
>>>
>> Try new 12.2.2 - this release should fix memory issues with Bluestore.
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in OSDs running 12.2.1 beyond the buffer_anon mempool leak

2017-12-05 Thread Subhachandra Chandra
That is what I will try today. I tried "filestore" with 12.2.1 and did not
see any issues. Will repeat the experiment with "bluestore" and 12.2.2.

Thanks
Subhachandra

On Tue, Dec 5, 2017 at 5:14 AM, Konstantin Shalygin  wrote:

>   We are trying out Ceph on a small cluster and are observing memory
>> leakage in the OSD processes.
>>
> Try new 12.2.2 - this release should fix memory issues with Bluestore.
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Memory leak in OSDs running 12.2.1 beyond the buffer_anon mempool leak

2017-11-29 Thread Subhachandra Chandra
Hello,

   We are trying out Ceph on a small cluster and are observing memory
leakage in the OSD processes. The leak seems to be in addition to the known
leak related to the "buffer_anon" pool and is high enough for the processes
to run against their memory limits in a few hours.

The following table gives a snapshot of increase in memory being used by
one of the OSD processes over an hour (t+63 indicates 63 minutes after the
first snapshot). Full mempool dumps and output of top are at the bottom.
Over an hour the OSDs went from RSS in the range 469-704MB to 735-844MB.
The container are restricted to 1GB of memory which causes them to restart
after a few hours.

t+00 t+11 t+63
683980   706980   786324VmRSS KB
5803 1045732308 buffer_anon KB(dump_mempools)
437369   444945   458688total   KB(dump_mempools)



Our setup is as follows:
* 3 nodes each with 30 OSDs for a total of 90 OSDs.
* Running Luminous (12.2.1)  official docker images on top of CoreOS
* The OSDs use Bluestore with all the db.* partitions on the same drive
* The nodes have 32GB of RAM and 8 cores. The test cluster nodes do have
less than the recommended amount of RAM per OSD to constrain them and find
problems
* The cluster currently has 501 PGs/OSD (Again higher than recommended for
testing)
* The pools are setup for RGW usage with replication_factor of 3 on all the
pools (2752 PGs) except default.rgw.buckets.data (4096 PGs) which is setup
with 6+3 erasure coding.
* The clients use the python rados library to push 128MB files directly
into the default.rgw.buckets.data pool. There are 3 clients running in
parallel on VMs and are pushing  about 350-400MB/s in aggregate.

The conf file with non-default settings looks like
[global]
mon_max_pg_per_osd = 750
mon_osd_down_out_interval = 900
mon_pg_warn_max_per_osd = 600
osd_crush_chooseleaf_type = 0
osd_map_message_max = 10
osd_max_pg_per_osd_hard_ratio = 1.2

[mon]
mon_max_pgmap_epochs = 100
mon_min_osdmap_epochs = 100

[osd]
bluestore_cache_kv_ratio = .95
bluestore_cache_kv_max = 67108864
bluestore_cache_meta_ratio = .05
bluestore_cache_size = 268435456
osd_map_cache_size = 50
osd_map_max_advance = 25
osd_map_share_max_epochs = 25
osd_max_object_size = 1073741824
osd_max_write_size = 256
osd_pg_epoch_persisted_max_stale = 25
osd_pool_erasure_code_stripe_unit = 4194304


top - 20:46:18 up 1 day,  2:21,  2 users,  load average: 3.13, 1.78, 1.25
Tasks: 567 total,   1 running, 566 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.7 us,  5.9 sy,  0.0 ni, 73.5 id, 11.1 wa,  0.3 hi,  2.5 si,
0.0 st
KiB Mem:  32981660 total, 24427392 used,  8554268 free,   351396 buffers
KiB Swap:0 total,0 used,0 free.  2803348 cached Mem

PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND
 376901 64045 20   0 150 704896  29056 S  14.4  2.1   1:26.72
ceph-osd
 370611 64045 20   0 1527432 698080  29476 S   2.0  2.1   1:29.03
ceph-osd
 396886 64045 20   0 1486584 696480  29060 S   2.0  2.1   1:22.93
ceph-osd
 382254 64045 20   0 1516968 690196  28984 S   3.0  2.1   1:27.15
ceph-osd
 359523 64045 20   0 1516888 686728  29332 S   3.5  2.1   1:28.67
ceph-osd
 366478 64045 20   0 1560912 683980  29076 S   1.5  2.1   1:28.59
ceph-osd
 382255 64045 20   0 1493116 669616  29276 S   1.5  2.0   1:29.46
ceph-osd
 360152 64045 20   0 1529896 28  29268 S   0.5  2.0   1:27.96
ceph-osd
 372155 64045 20   0 1523640 662492  29416 S  17.4  2.0   1:29.79
ceph-osd
 358800 64045 20   0 1513640 662224  29184 S  13.9  2.0   1:29.80
ceph-osd
 360142 64045 20   0 1517992 661868  29328 S   0.5  2.0   1:31.69
ceph-osd
 398310 64045 20   0 1504552 658216  28796 S   1.0  2.0   1:20.62
ceph-osd
 368705 64045 20   0 1505544 657776  29292 S   1.0  2.0   1:27.32
ceph-osd
 386044 64045 20   0 1501488 655960  29536 S   3.0  2.0   1:24.87
ceph-osd
 386940 64045 20   0 1503056 652552  29152 S   4.5  2.0   1:28.22
ceph-osd
 386050 64045 20   0 1489996 650628  28800 S   1.0  2.0   1:28.46
ceph-osd
 402086 64045 20   0 1504528 646672  29344 S   2.5  2.0   1:26.96
ceph-osd
 400590 64045 20   0 1487424 642288  29348 S   3.5  1.9   1:21.55
ceph-osd
 387860 64045 20   0 1504520 641296  29316 S   4.0  1.9   1:19.98
ceph-osd
 392900 64045 20   0 1493492 637572  29156 S   1.5  1.9   1:26.86
ceph-osd
 375314 64045 20   0 1520448 629272  29412 S   1.0  1.9   1:32.04
ceph-osd
 372038 64045 20   0 1497992 627176  29300 S   1.0  1.9   1:30.36
ceph-osd
 385149 64045 20   0 1514284 624428  28960 S   0.5  1.9   1:28.56
ceph-osd
 382236 64045 20   0 1512248 616256  29568 S   2.0  1.9   1:24.03
ceph-osd
 374703 64045 20   0 1511740 571404  29628 S   2.5  1.7   1:27.88
ceph-osd
 367873 64045 20   0 1394740 564488  29012 S   2.5  1.7   1:31.64
ceph-osd
 360104 64045 20   0 1373880 532880  29132 S   2.5  1.6   1:32.11
ceph-osd
 376002 64045 20   0 1391576 516256  29132 S   0.5  1.6   

[ceph-users] ceph-disk should wait for device file /dev/sdXX file to be created before trying to run mkfs

2017-11-13 Thread Subhachandra Chandra
Hi,

I am using ceph-ansible to deploy ceph to run as a container on VMs
running on my laptop. The VMs run CoreOS and the docker image being
installed has the tag "tag-build-master-luminous-ubuntu-16.04". The backend
is "bluestore".

  While running the "ceph-osd-prepare" stage, the installation fails while
trying to create an XFS file system on /dev/sdX1. The issue seems to be
that the device file /dev/sdX1is not visible inside the container when this
command is run and the file is visible/created with a small delay. I
verified that the file did exist shortly after the command failed by
looking inside the container and the host.

When creating a OSD node with two data drives, the command sometimes
succeeds on one or more of the drives while the other one fails. When run
enough times it succeeds on all the drives on that node.

populate_data_path_device: Creating xfs fs on /dev/sdb1
command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 --
/dev/sdb1
/dev/sdb1: No such file or directory

It looks like ceph-disk should wait for the device file to exists before
running populate_data_path_device. Does this seem correct or am I hitting
some other issue?

Thanks
Chandra
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com