Re: [ceph-users] bluestore OSD did not start at system-boot

2018-04-05 Thread Nico Schottelius

Hey Ansgar,

we have a similar "problem": in our case all servers are wiped on
reboot, as they boot their operating system from the network into
initramfs.

While the OS configuration is done with cdist [0], we consider ceph osds
more dynamic data and just re-initialise all osds on boot using the
ungleich-tools [1] suite, which we created to work with ceph clusters
mostly.

Especially [2] might be of interest for you.

HTH,

Nico

[0] https://www.nico.schottelius.org/software/cdist/
[1] https://github.com/ungleich/ungleich-tools
[2] https://github.com/ungleich/ungleich-tools/blob/master/ceph-osd-activate-all



Ansgar Jazdzewski  writes:

> hi folks,
>
> i just figured out that my ODS's did not start because the filsystem
> is not mounted.
>
> So i wrote a script to Hack my way around it
> #
> #! /usr/bin/env bash
>
> DATA=( $(ceph-volume lvm list | grep -e 'osd id\|osd fsid' | awk
> '{print $3}' | tr '\n' ' ') )
>
> OSDS=$(( ${#DATA[@]}/2 ))
>
> for OSD in $(seq 0 $(($OSDS-1))); do
>  ceph-volume lvm activate "${DATA[( $OSD*2 )]}" "${DATA[( $OSD*2+1 )]}"
> done
> #
>
> i'am sure that this is not the way it should be!? so any help i
> welcome to figure out why my BlueStore-OSD is not mounted at
> boot-time.
>
> Thanks,
> Ansgar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stuck in creating+activating

2018-03-17 Thread Nico Schottelius

You hit the nail! Thanks a lot!

Anytime around in Switzerland for a free beer [tm]?


Vladimir Prokofev <v...@prokofev.me> writes:

> My first guess would be PG overdose protection kicked in [1][2]
> You can try fixing it by increasing allowed number of PG per OSD with
> ceph tell mon.* injectargs '--mon_max_pg_per_osd 500'
> ceph tell osd.* injectargs '--mon_max_pg_per_osd 500'
> and then triggering CRUSH algorithm update by restarting an OSD for example.
>
> [1] https://ceph.com/community/new-luminous-pg-overdose-protection/
> [2]
> https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/
>
> 2018-03-17 12:15 GMT+03:00 Nico Schottelius <nico.schottel...@ungleich.ch>:
>
>>
>> Good morning,
>>
>> some days ago we created a new pool with 512 pgs, and originally 5 osds.
>> We use the device class "ssd" and a crush rule that maps all data for
>> the pool "ssd" to the ssd device class osds.
>>
>> While creating, one of the ssds failed and we are left with 4 osds:
>>
>> [10:00:22] server2.place6:/var/log/ceph# ceph osd tree
>> ID CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
>> -1   135.12505 root default
>> -751.36911 host server2
>> 15   hdd-big   9.09511 osd.15   up  1.0 1.0
>> 20   hdd-big   9.09511 osd.20   up  1.0 1.0
>> 21   hdd-big   9.09511 osd.21   up  1.0 1.0
>>  7 hdd-small   4.54776 osd.7up  1.0 1.0
>>  8 hdd-small   4.54776 osd.8up  1.0 1.0
>> 10 hdd-small   4.54776 osd.10   up  1.0 1.0
>> 26 hdd-small   4.54776 osd.26   up  1.0 1.0
>> 14  notinuse   5.45741 osd.14   up  1.0 1.0
>> 12   ssd   0.21767 osd.12   up  1.0 1.0
>> 24   ssd   0.21767 osd.24   up  1.0 1.0
>> -542.50967 host server3
>>  9   hdd-big   9.09511 osd.9up  1.0 1.0
>> 16   hdd-big   9.09511 osd.16   up  1.0 1.0
>> 19   hdd-big   9.09511 osd.19   up  1.0 1.0
>>  3 hdd-small   4.54776 osd.3up  1.0 1.0
>>  5 hdd-small   4.54776 osd.5up  1.0 1.0
>>  6 hdd-small   4.54776 osd.6up  1.0 1.0
>> 11  notinuse   0.45424 osd.11   up  1.0 1.0
>> 13  notinuse   0.90907 osd.13   up  1.0 1.0
>> 25   ssd   0.21776 osd.25   up  1.0 1.0
>> -241.24626 host server4
>>  2   hdd-big   9.09511 osd.2up  1.0 1.0
>> 17   hdd-big   9.09511 osd.17   up  1.0 1.0
>> 18   hdd-big   9.09511 osd.18   up  1.0 1.0
>>  0 hdd-small   4.54776 osd.0up  1.0 1.0
>>  1 hdd-small   4.54776 osd.1up  1.0 1.0
>> 22 hdd-small   4.54776 osd.22   up  1.0 1.0
>>  4  notinuse   0.0 osd.4up  1.0 1.0
>> 23   ssd   0.21767 osd.23   up  1.0 1.0
>> [10:04:27] server2.place6:/var/log/ceph#
>>
>> We first had about 160 pgs stuck in creating+activating. After
>> restarting all osds in the ssd class one by one, it shifted to
>> 100 activating and 60  creating+activating:
>>
>>
>> [10:00:18] server2.place6:/var/log/ceph# ceph -s
>>   cluster:
>> id: 1ccd84f6-e362-4c50-9ffe-59436745e445
>> health: HEALTH_ERR
>> 1803200/13770981 objects misplaced (13.094%)
>> Reduced data availability: 175 pgs inactive
>> Degraded data redundancy: 857547/13770981 objects degraded
>> (6.227%), 197 pgs degraded, 123 pgs undersized
>> 39 slow requests are blocked > 32 sec
>> 40 stuck requests are blocked > 4096 sec
>>
>>   services:
>> mon: 3 daemons, quorum black1,black2,black3
>> mgr: black3(active), standbys: black2, black1
>> osd: 27 osds: 27 up, 27 in; 156 remapped pgs
>>
>>   data:
>> pools:   2 pools, 1024 pgs
>> objects: 4482k objects, 17725 GB
>> usage:   55542 GB used, 83188 GB / 135 TB avail
>> pgs: 17.090% pgs not active
>>  857547/13770981 objects degraded (6.227%)
>>  1803200/13770981 objects misplaced (13.094%)
>>  640 active+clean
>>  105 active+undersized+degraded+remapped+backfill_wait
>>  100 activating
>>   

[ceph-users] Stuck in creating+activating

2018-03-17 Thread Nico Schottelius

Good morning,

some days ago we created a new pool with 512 pgs, and originally 5 osds.
We use the device class "ssd" and a crush rule that maps all data for
the pool "ssd" to the ssd device class osds.

While creating, one of the ssds failed and we are left with 4 osds:

[10:00:22] server2.place6:/var/log/ceph# ceph osd tree
ID CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
-1   135.12505 root default
-751.36911 host server2
15   hdd-big   9.09511 osd.15   up  1.0 1.0
20   hdd-big   9.09511 osd.20   up  1.0 1.0
21   hdd-big   9.09511 osd.21   up  1.0 1.0
 7 hdd-small   4.54776 osd.7up  1.0 1.0
 8 hdd-small   4.54776 osd.8up  1.0 1.0
10 hdd-small   4.54776 osd.10   up  1.0 1.0
26 hdd-small   4.54776 osd.26   up  1.0 1.0
14  notinuse   5.45741 osd.14   up  1.0 1.0
12   ssd   0.21767 osd.12   up  1.0 1.0
24   ssd   0.21767 osd.24   up  1.0 1.0
-542.50967 host server3
 9   hdd-big   9.09511 osd.9up  1.0 1.0
16   hdd-big   9.09511 osd.16   up  1.0 1.0
19   hdd-big   9.09511 osd.19   up  1.0 1.0
 3 hdd-small   4.54776 osd.3up  1.0 1.0
 5 hdd-small   4.54776 osd.5up  1.0 1.0
 6 hdd-small   4.54776 osd.6up  1.0 1.0
11  notinuse   0.45424 osd.11   up  1.0 1.0
13  notinuse   0.90907 osd.13   up  1.0 1.0
25   ssd   0.21776 osd.25   up  1.0 1.0
-241.24626 host server4
 2   hdd-big   9.09511 osd.2up  1.0 1.0
17   hdd-big   9.09511 osd.17   up  1.0 1.0
18   hdd-big   9.09511 osd.18   up  1.0 1.0
 0 hdd-small   4.54776 osd.0up  1.0 1.0
 1 hdd-small   4.54776 osd.1up  1.0 1.0
22 hdd-small   4.54776 osd.22   up  1.0 1.0
 4  notinuse   0.0 osd.4up  1.0 1.0
23   ssd   0.21767 osd.23   up  1.0 1.0
[10:04:27] server2.place6:/var/log/ceph#

We first had about 160 pgs stuck in creating+activating. After
restarting all osds in the ssd class one by one, it shifted to
100 activating and 60  creating+activating:


[10:00:18] server2.place6:/var/log/ceph# ceph -s
  cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_ERR
1803200/13770981 objects misplaced (13.094%)
Reduced data availability: 175 pgs inactive
Degraded data redundancy: 857547/13770981 objects degraded 
(6.227%), 197 pgs degraded, 123 pgs undersized
39 slow requests are blocked > 32 sec
40 stuck requests are blocked > 4096 sec

  services:
mon: 3 daemons, quorum black1,black2,black3
mgr: black3(active), standbys: black2, black1
osd: 27 osds: 27 up, 27 in; 156 remapped pgs

  data:
pools:   2 pools, 1024 pgs
objects: 4482k objects, 17725 GB
usage:   55542 GB used, 83188 GB / 135 TB avail
pgs: 17.090% pgs not active
 857547/13770981 objects degraded (6.227%)
 1803200/13770981 objects misplaced (13.094%)
 640 active+clean
 105 active+undersized+degraded+remapped+backfill_wait
 100 activating
 60  creating+activating
 50  active+recovery_wait+degraded
 21  active+remapped+backfill_wait
 16  active+recovery_wait+undersized+degraded+remapped
 15  activating+degraded
 9   active+recovery_wait+degraded+remapped
 3   active+recovery_wait+remapped
 3   active+recovery_wait
 2   active+undersized+degraded+remapped+backfilling

  io:
client:   519 kB/s rd, 38025 kB/s wr, 4 op/s rd, 20 op/s wr
recovery: 1694 kB/s, 0 objects/s

I looked into the archives, but did not find anything that directly
related to our situation. We are using ceph 12.2.4.

An excerpt from our ceph health detail looks like this:

HEALTH_ERR 1803116/13770981 objects misplaced (13.094%); Reduced data 
availability: 175 pgs inactive; Degraded data redundancy: 856881/13770981 
objects degraded (6.222%), 197 pgs degraded, 123 pgs undersized; 53 slow 
requests are blocked > 32 sec; 40 stuck requests are blocked > 4096 sec
OBJECT_MISPLACED 1803116/13770981 objects misplaced (13.094%)
PG_AVAILABILITY Reduced data availability: 175 pgs inactive
pg 7.118 is stuck inactive for 183000.110669, current state 
creating+activating, last acting [12,23,25]
pg 7.11a is stuck inactive for 38143.679989, current state activating, last 
acting [25,24,23]
pg 7.121 is stuck inactive for 38143.670149, current state activating, last 
acting [25,23,12]
pg 7.123 is stuck 

Re: [ceph-users] Ceph iSCSI is a prank?

2018-02-28 Thread Nico Schottelius

Max,

I understand your frustration.
However, last time I checked, ceph was open source.

Some of you might not remember, but one major reason why open source is
great is that YOU CAN DO your own modifications.

If you need a change like iSCSI support and it isn't there,
it is probably best, if you implement it.

Even if a lot of people are voluntarily contributing to open source
and even if there is a company behind ceph as a product, there
is no right for a feature.

Best,

Nico

p.s.: If your answer is "I don't have experience to implement it" then
my answer will be "hire somebody" and if your answer is "I don't have the
money", my answer is "You don't have the resource to have that feature".
(from: the book of reality)

Max Cuttins  writes:

> Sorry for being rude Ross,
>
> I follow Ceph since 2014 waiting for iSCSI support in order to use it
> with Xen.
> When finally it seemds it was implemented the OS requirements are
> irrealistic.
> Seems a bad prank. 4 year waiting for this... and still not true support
> yet.

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-02-23 Thread Nico Schottelius

A very interesting question and I would add the follow up question:

Is there an easy way to add an external DB/WAL devices to an existing
OSD?

I suspect that it might be something on the lines of:

- stop osd
- create a link in ...ceph/osd/ceph-XX/block.db to the target device
- (maybe run some kind of osd mkfs ?)
- start osd

Has anyone done this so far or recommendations on how to do it?

Which also makes me wonder: what is actually the format of WAL and
BlockDB in bluestore? Is there any documentation available about it?

Best,

Nico


Caspar Smit  writes:

> Hi All,
>
> What would be the proper way to preventively replace a DB/WAL SSD (when it
> is nearing it's DWPD/TBW limit and not failed yet).
>
> It hosts DB partitions for 5 OSD's
>
> Maybe something like:
>
> 1) ceph osd reweight 0 the 5 OSD's
> 2) let backfilling complete
> 3) destroy/remove the 5 OSD's
> 4) replace SSD
> 5) create 5 new OSD's with seperate DB partition on new SSD
>
> When these 5 OSD's are big HDD's (8TB) a LOT of data has to be moved so i
> thought maybe the following would work:
>
> 1) ceph osd set noout
> 2) stop the 5 OSD's (systemctl stop)
> 3) 'dd' the old SSD to a new SSD of same or bigger size
> 4) remove the old SSD
> 5) start the 5 OSD's (systemctl start)
> 6) let backfilling/recovery complete (only delta data between OSD stop and
> now)
> 6) ceph osd unset noout
>
> Would this be a viable method to replace a DB SSD? Any udev/serial nr/uuid
> stuff preventing this to work?
>
> Or is there another 'less hacky' way to replace a DB SSD without moving too
> much data?
>
> Kind regards,
> Caspar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restoring keyring capabilities

2018-02-16 Thread Nico Schottelius

It seems your monitor capabilities are different to mine:

root@server3:/opt/ungleich-tools# ceph -k 
/var/lib/ceph/mon/ceph-server3/keyring -n mon. auth list
2018-02-16 20:34:59.257529 7fe0d5c6b700  0 librados: mon. authentication error 
(13) Permission denied
[errno 13] error connecting to the cluster
root@server3:/opt/ungleich-tools# cat /var/lib/ceph/mon/ceph-server3/keyring
[mon.]
key = AQCp9IVa2GmYARAAVvCGfNpXfxOoUf119KAq1g==

Where you have

> root@ceph-mon1:/# cat /var/lib/ceph/mon/ceph-ceph-mon1/keyring
> [mon.]
> key = AQD1y3RapVDCNxAAmInc8D3OPZKuTVeUcNsPug==
> caps mon = "allow *"

Which probably explains why it works for you, but not for me.

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Restoring keyring capabilities

2018-02-16 Thread Nico Schottelius

Saw that, too, however it does not work:

root@server3:/var/lib/ceph/mon/ceph-server3# ceph -n mon. --keyring keyring  
auth caps client.admin mds 'allow *' osd 'allow *' mon 'allow *'
2018-02-16 17:23:38.154282 7f7e257e3700  0 librados: mon. authentication error 
(13) Permission denied
[errno 13] error connecting to the cluster

... which kind of makes sense, as the mon. key does not have
capabilities for it. Then again, I wonder how monitors actually talk to
each other...

Michel Raabe <rmic...@devnu11.net> writes:

> On 02/16/18 @ 18:21, Nico Schottelius wrote:
>> on a test cluster I issued a few seconds ago:
>>
>>   ceph auth caps client.admin mgr 'allow *'
>>
>> instead of what I really wanted to do
>>
>>   ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
>>   mds allow
>>
>> Now any access to the cluster using client.admin correctly results in
>> client.admin authentication error (13) Permission denied.
>>
>> Is there any way to modify the keyring capabilities "from behind",
>> i.e. by modifying the rocksdb of the monitors or similar?
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015474.html
>
> Not verified.
>
> Regards,
> Michel


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Restoring keyring capabilities

2018-02-16 Thread Nico Schottelius

Hello,

on a test cluster I issued a few seconds ago:

  ceph auth caps client.admin mgr 'allow *'

instead of what I really wanted to do

  ceph auth caps client.admin mgr 'allow *' mon 'allow *' osd 'allow *' \
  mds allow

Now any access to the cluster using client.admin correctly results in
client.admin authentication error (13) Permission denied.

Is there any way to modify the keyring capabilities "from behind",
i.e. by modifying the rocksdb of the monitors or similar?

If the answer is no, it's not a big problem, as we can easily destroy
the cluster, but if the answer is yes, it would be interesting to know
how to get out of this.

Best,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is there a "set pool readonly" command?

2018-02-11 Thread Nico Schottelius

Hello,

we have one pool, in which about 10 disks failed last week (fortunately
mostly sequentially), which now has now some pgs that are only left on
one disk.

Is there a command to set one pool into "read-only" mode or even
"recovery io-only" mode so that the only thing same is doing is
recovering and no client i/o will disturb that process?

Best,

Nico



--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-disk vs. ceph-volume: both error prone

2018-02-09 Thread Nico Schottelius

Dear list,

for a few days we are disecting ceph-disk and ceph-volume to find out,
what is the appropriate way of creating partitions for ceph.

For years already I found ceph-disk (and especially ceph-deploy) very
error prone and we at ungleich are considering to rewrite both into a
ceph-block-do-what-I-want-tool.

Only considering bluestore, I see that ceph-disk creates two partitions:

Device  StartEndSectors   Size Type
/dev/sde12048 206847 204800   100M Ceph OSD
/dev/sde2  206848 2049966046 2049759199 977.4G unknown

Does somebody know, what exactly belongs onto the xfs formatted first
disk and how is the data/wal/db device sde2 formatted?

What I really would like to know is, how can we best extract this
information so that we are not depending on ceph-{disk,volume} anymore.

Any pointer for the on disk format would be much appreciated!

Best,

Nico




--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inactive PGs rebuild is not priorized

2018-02-03 Thread Nico Schottelius

Good morning,

after another disk failure, we currently have 7 inactive pgs [1], which
are stalling IO from the affected VMs.

It seems that ceph, when rebuilding does not focus on repairing
the inactive PGs first, which surprised us quite a lot:

It does not repair the inactive first, but mixes inactive with
active+undersized+degraded+remapped+backfill_wait.

Is this a misconfiguration on our side or a design aspect of ceph?

I have attached ceph -s from three times while rebuilding below.

First the number of active+undersized+degraded+remapped+backfill_wait.
decreases and much later then
undersized+degraded+remapped+backfill_wait+peered decreases

If anyone could comment on this, I would be very thankful to know how to
progress here, as we had 6 disk failures this week and each time we had
inactive pgs that stalled the VM i/o.

Best,

Nico


[1]
  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
108752/3920931 objects misplaced (2.774%)
Reduced data availability: 7 pgs inactive
Degraded data redundancy: 419786/3920931 objects degraded 
(10.706%), 147 pgs unclean, 140 pgs degraded, 140 pgs und
ersized

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: server3, server2
osd: 53 osds: 52 up, 52 in; 147 remapped pgs

  data:
pools:   2 pools, 1280 pgs
objects: 1276k objects, 4997 GB
usage:   13481 GB used, 26853 GB / 40334 GB avail
pgs: 0.547% pgs not active
 419786/3920931 objects degraded (10.706%)
 108752/3920931 objects misplaced (2.774%)
 1133 active+clean
 108  active+undersized+degraded+remapped+backfill_wait
 25   active+undersized+degraded+remapped+backfilling
 7active+remapped+backfill_wait
 6undersized+degraded+remapped+backfilling+peered
 1undersized+degraded+remapped+backfill_wait+peered

  io:
client:   29980 B/s rd,  kB/s wr, 17 op/s rd, 74 op/s wr
recovery: 71727 kB/s, 17 objects/s

[2]

[11:20:15] server3:~# ceph -s
  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
103908/3920967 objects misplaced (2.650%)
Reduced data availability: 7 pgs inactive
Degraded data redundancy: 380860/3920967 objects degraded (9.713%), 
144 pgs unclean, 137 pgs degraded, 137 pgs undersized

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: server3, server2
osd: 53 osds: 52 up, 52 in; 144 remapped pgs

  data:
pools:   2 pools, 1280 pgs
objects: 1276k objects, 4997 GB
usage:   13630 GB used, 26704 GB / 40334 GB avail
pgs: 0.547% pgs not active
 380860/3920967 objects degraded (9.713%)
 103908/3920967 objects misplaced (2.650%)
 1136 active+clean
 105  active+undersized+degraded+remapped+backfill_wait
 25   active+undersized+degraded+remapped+backfilling
 7active+remapped+backfill_wait
 6undersized+degraded+remapped+backfilling+peered
 1undersized+degraded+remapped+backfill_wait+peered

  io:
client:   40201 B/s rd, 1189 kB/s wr, 16 op/s rd, 74 op/s wr
recovery: 54519 kB/s, 13 objects/s


[3]


  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
88382/3921066 objects misplaced (2.254%)
Reduced data availability: 4 pgs inactive
Degraded data redundancy: 285528/3921066 objects degraded (7.282%), 
127 pgs unclean
, 121 pgs degraded, 115 pgs undersized
14 slow requests are blocked > 32 sec

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: server3, server2
osd: 53 osds: 52 up, 52 in; 121 remapped pgs

  data:
pools:   2 pools, 1280 pgs
objects: 1276k objects, 4997 GB
usage:   14014 GB used, 26320 GB / 40334 GB avail
pgs: 0.313% pgs not active
 285528/3921066 objects degraded (7.282%)
 88382/3921066 objects misplaced (2.254%)
 1153 active+clean
 78   active+undersized+degraded+remapped+backfill_wait
 33   active+undersized+degraded+remapped+backfilling
 6active+recovery_wait+degraded
 6active+remapped+backfill_wait
 2undersized+degraded+remapped+backfill_wait+peered
 2undersized+degraded+remapped+backfilling+peered

  io:
client:   56370 B/s rd, 5304 kB/s wr, 11 op/s rd, 44 op/s wr
recovery: 37838 kB/s, 9 objects/s


And our tree:

[12:53:57] server4:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
-1   39.84532 root default
-67.28383 host server1
25   hdd  4.5 osd.25   up  1.0 1.0
48   ssd  0.22198 osd.48   up  1.0 1.0
49   ssd  0.22198  

Re: [ceph-users] [Best practise] Adding new data center

2018-01-29 Thread Nico Schottelius

Hey Wido,

> [...]
> Like I said, latency, latency, latency. That's what matters. Bandwidth
> usually isn't a real problem.

I imagined that.

> What latency do you have with a 8k ping between hosts?

As the link will be setup this week, I cannot tell yet.

However, currently we have on a 65km link with ~2ms latency.
In our data center, we currently have ~0.4 ms latency.
(both 8k pings).

Do you see similar latencies in your setup?

Best,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Best practise] Adding new data center

2018-01-29 Thread Nico Schottelius

Good evening list,

we are soon expanding our data center [0] to a new location [1].

We are mainly offering VPS / VM Hosting, so rbd is our main interest.
We have a low latency 10 Gbit/s link between our other location [2] and
we are wondering, what is the best practise for expanding.

Naturally we think about creating a new ceph cluster that is independent
from the first location, so connection interrupts (unlikely) or
different power outages (more likely) are becoming a concern.

Given that we running two different ceph clusters, we think about rbd
mirroring, so that we can (partially) mirror one side to the other or
vice versa.

However using this approach we lose the possibility to have very big rbd
images (big as in 10ths to 100ds of TBs), as the storage is divided.

My question to the list is, how have you handled this situation so far?

Would you also recommend splitting or have you expanded ceph clusters
over several kilometers of range so far? With what experiences?

I am very curious to hear your answers!

Best,

Nico



[0] https://datacenterlight.ch
[1] Linthal, in pretty Glarus
https://www.google.ch/maps/place/Linthal,+8783+Glarus+S%C3%BCd/
[2] Schwanden, also pretty
https://www.google.ch/maps/place/Schwanden,+8762+Glarus+S%C3%BCd/

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Nico Schottelius

Hey Burkhard,

we did actually restart osd.61, which led to the current status.

Best,

Nico


Burkhard Linke <burkhard.li...@computational.bio.uni-giessen.de> writes:>
> On 01/23/2018 08:54 AM, Nico Schottelius wrote:
>> Good morning,
>>
>> the osd.61 actually just crashed and the disk is still intact. However,
>> after 8 hours of rebuilding, the unfound objects are still missing:
>
> *snipsnap*
>>
>>
>> Is there any chance to recover those pgs or did we actually lose data
>> with a 2 disk failure?
>>
>> And is there any way out  of this besides going with
>>
>>  ceph pg {pg-id} mark_unfound_lost revert|delete
>>
>> ?
>
> Just my 2 cents:
>
> If the disk is still intact and the data is still readable, you can try
> to export the pg content with ceph-objectstore-tool, and import it into
> another OSD.
>
> On the other hand: if the disk is still intact, just restart the OSD?

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Nico Schottelius

... while trying to locate which VMs are potentially affected by a
revert/delete, we noticed that

root@server1:~# rados -p one-hdd ls

hangs. Where does ceph store the index of block devices found in a pool?
And is it possible that this information is in one of the damaged pgs?

Nico


Nico Schottelius <nico.schottel...@ungleich.ch> writes:

> Good morning,
>
> the osd.61 actually just crashed and the disk is still intact. However,
> after 8 hours of rebuilding, the unfound objects are still missing:
>
> root@server1:~# ceph -s
>   cluster:
> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 111436/3017766 objects misplaced (3.693%)
> 9377/1005922 objects unfound (0.932%)
> Reduced data availability: 84 pgs inactive
> Degraded data redundancy: 277034/3017766 objects degraded 
> (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized
> mon server2 is low on available space
>
>   services:
> mon: 3 daemons, quorum server5,server3,server2
> mgr: server5(active), standbys: server2, 2, 0, server3
> osd: 54 osds: 54 up, 54 in; 84 remapped pgs
>  flags noscrub,nodeep-scrub
>
>   data:
> pools:   3 pools, 1344 pgs
> objects: 982k objects, 3837 GB
> usage:   10618 GB used, 39030 GB / 49648 GB avail
> pgs: 6.250% pgs not active
>  277034/3017766 objects degraded (9.180%)
>  111436/3017766 objects misplaced (3.693%)
>  9377/1005922 objects unfound (0.932%)
>  1260 active+clean
>  84   recovery_wait+undersized+degraded+remapped+peered
>
>   io:
> client:   68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr
>
> We tried restarting osd.61, but ceph health detail does not change
> anymore:
>
> HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects 
> misplaced (3.69
> 3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs 
> inacti
> ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 
> pgs uncle
> an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space
> OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
> OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%)
> OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%)
> pg 4.fa has 117 unfound objects
> pg 4.ff has 107 unfound objects
> pg 4.fd has 113 unfound objects
> ...
> pg 4.2a has 108 unfound objects
>
> PG_AVAILABILITY Reduced data availability: 84 pgs inactive
> pg 4.2a is stuck inactive for 64117.189552, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.31 is stuck inactive for 64117.147636, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.32 is stuck inactive for 64117.178461, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.34 is stuck inactive for 64117.150475, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> ...
>
>
> PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded 
> (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized
> pg 4.2a is stuck unclean for 131612.984555, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting [61]
> pg 4.31 is stuck undersized for 221.568468, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting [61]
>
>
> Is there any chance to recover those pgs or did we actually lose data
> with a 2 disk failure?
>
> And is there any way out  of this besides going with
>
> ceph pg {pg-id} mark_unfound_lost revert|delete
>
> ?
>
> Best,
>
> Nico
>
> p.s.: the ceph 4.2a query:
>
> {
> "state": "recovery_wait+undersized+degraded+remapped+peered",
> "snap_trimq": "[]",
> "epoch": 17879,
> "up": [
> 17,
> 13,
> 25
> ],
> "acting": [
> 61
> ],
> "backfill_targets": [
> "13",
> "17",
> "25"
> ],
> "actingbackfill": [
> "13",
> "17",
> "25",
> "61"
> ],
> "info": {
> "pgid": "4.2a",
> "last_update": "17529'53875",
> "last_complete": "17217'45447",
> "l

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius
  "log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 0,
"last_epoch_clean": 0,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "0.00",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "0.00",
"last_clean_scrub_stamp": "0.00",
"log_size": 0,
"ondisk_log_size": 0,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
    "num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
17,
13,
25
],
"acting": [
61
],
"blocked_by": [],
"up_primary": 17,
"acting_primary": 61
},
"empty": 0,
"dne": 0,
"incomplete": 1,
"last_epoch_started": 17137,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
{
"peer": "25",
"pgid": "4.2a",
"last_update": "17529'53875",
"last_complete": "17529'53875",
"log_tail": "17090'43812",
"last_user_version": 53875,
"last_backfill": "MIN",
"last_backfill_bitwise": 1,
"purged_snaps": [
{
"start": "1",
"length": "3"
},
{
"start": "6",
"length": "8"
},
{
"start": "10",
"length": "2"
}
],
"history": {
"epoch_created": 9134,
"epoch_pool_created": 9134,
"last_epoch_started": 17528,
"last_interval_started": 17527,
"last_epoch_clean": 17079,
"last_interval_clean": 17078,
"last_epoc

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

While writing, yet another disk (osd.61 now) died and now we have
172 pgs down:

[19:32:35] server2:~# ceph -s
  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
21033/2263701 objects misplaced (0.929%)
Reduced data availability: 186 pgs inactive, 172 pgs down
Degraded data redundancy: 67370/2263701 objects degraded (2.976%), 
219 pgs unclean, 46 pgs degraded, 46 pgs undersized
mon server2 is low on available space

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: server2, 2, 0, server3
osd: 54 osds: 53 up, 53 in; 47 remapped pgs
 flags noscrub,nodeep-scrub

  data:
pools:   3 pools, 1344 pgs
objects: 736k objects, 2889 GB
usage:   8517 GB used, 36474 GB / 44991 GB avail
pgs: 13.839% pgs not active
 67370/2263701 objects degraded (2.976%)
 21033/2263701 objects misplaced (0.929%)
 1125 active+clean
 172  down
 26   active+undersized+degraded+remapped+backfilling
 14   undersized+degraded+remapped+backfilling+peered
 6active+undersized+degraded+remapped+backfill_wait
 1active+remapped+backfill_wait

  io:
client:   835 kB/s rd, 262 kB/s wr, 16 op/s rd, 25 op/s wr
recovery: 102 MB/s, 26 objects/s

What is the most sensible way to get out of this situation?





David Turner <drakonst...@gmail.com> writes:

> I do remember seeing that exactly. As the number of recovery_wait pgs
> decreased, the number of unfound objects decreased until they were all
> found.  Unfortunately it blocked some IO from happening during the
> recovery, but in the long run we ended up with full data integrity again.
>
> On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Hey David,
>>
>> thanks for the fast answer. All our pools are running with size=3,
>> min_size=2 and the two disks were in 2 different hosts.
>>
>> What I am a bit worried about is the output of "ceph pg 4.fa query" (see
>> below) that indicates that ceph already queried all other hosts and did
>> not find the data anywhere.
>>
>> Do you remember having seen something similar?
>>
>> Best,
>>
>> Nico
>>
>> David Turner <drakonst...@gmail.com> writes:
>>
>> > I have had the same problem before with unfound objects that happened
>> while
>> > backfilling after losing a drive. We didn't lose drives outside of the
>> > failure domains and ultimately didn't lose any data, but we did have to
>> > wait until after all of the PGs in recovery_wait state were caught up.
>> So
>> > if the 2 disks you lost were in the same host and your CRUSH rules are
>> set
>> > so that you can lose a host without losing data, then the cluster will
>> > likely find all of the objects by the time it's done backfilling.  With
>> > only losing 2 disks, I wouldn't worry about the missing objects not
>> > becoming found unless you're pool size=2.
>> >
>> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
>> > nico.schottel...@ungleich.ch> wrote:
>> >
>> >>
>> >> Hello,
>> >>
>> >> we added about 7 new disks yesterday/today and our cluster became very
>> >> slow. While the rebalancing took place, 2 of the 7 new added disks
>> >> died.
>> >>
>> >> Our cluster is still recovering, however we spotted that there are a lot
>> >> of unfound objects.
>> >>
>> >> We lost osd.63 and osd.64, which seem not to be involved into the sample
>> >> pg that has unfound objects.
>> >>
>> >> We were wondering why there are unfound objects, where they are coming
>> >> from and if there is a way to recover them?
>> >>
>> >> Any help appreciated,
>> >>
>> >> Best,
>> >>
>> >> Nico
>> >>
>> >>
>> >> Our status is:
>> >>
>> >>   cluster:
>> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>> >> health: HEALTH_WARN
>> >> 261953/3006663 objects misplaced (8.712%)
>> >> 9377/1002221 objects unfound (0.936%)
>> >> Reduced data availability: 176 pgs inactive
>> >> Degraded data redundancy: 609338/3006663 objects degraded
>> >> (20.266%), 243 pgs unclea
>> >> n, 222 pgs degraded, 213 pgs undersized
>> &g

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

Hey David,

thanks for the fast answer. All our pools are running with size=3,
min_size=2 and the two disks were in 2 different hosts.

What I am a bit worried about is the output of "ceph pg 4.fa query" (see
below) that indicates that ceph already queried all other hosts and did
not find the data anywhere.

Do you remember having seen something similar?

Best,

Nico

David Turner <drakonst...@gmail.com> writes:

> I have had the same problem before with unfound objects that happened while
> backfilling after losing a drive. We didn't lose drives outside of the
> failure domains and ultimately didn't lose any data, but we did have to
> wait until after all of the PGs in recovery_wait state were caught up.  So
> if the 2 disks you lost were in the same host and your CRUSH rules are set
> so that you can lose a host without losing data, then the cluster will
> likely find all of the objects by the time it's done backfilling.  With
> only losing 2 disks, I wouldn't worry about the missing objects not
> becoming found unless you're pool size=2.
>
> On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Hello,
>>
>> we added about 7 new disks yesterday/today and our cluster became very
>> slow. While the rebalancing took place, 2 of the 7 new added disks
>> died.
>>
>> Our cluster is still recovering, however we spotted that there are a lot
>> of unfound objects.
>>
>> We lost osd.63 and osd.64, which seem not to be involved into the sample
>> pg that has unfound objects.
>>
>> We were wondering why there are unfound objects, where they are coming
>> from and if there is a way to recover them?
>>
>> Any help appreciated,
>>
>> Best,
>>
>> Nico
>>
>>
>> Our status is:
>>
>>   cluster:
>> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>> health: HEALTH_WARN
>> 261953/3006663 objects misplaced (8.712%)
>> 9377/1002221 objects unfound (0.936%)
>> Reduced data availability: 176 pgs inactive
>> Degraded data redundancy: 609338/3006663 objects degraded
>> (20.266%), 243 pgs unclea
>> n, 222 pgs degraded, 213 pgs undersized
>> mon server2 is low on available space
>>
>>   services:
>> mon: 3 daemons, quorum server5,server3,server2
>> mgr: server5(active), standbys: 2, server2, 0, server3
>> osd: 54 osds: 54 up, 54 in; 234 remapped pgs
>>
>>   data:
>> pools:   3 pools, 1344 pgs
>> objects: 978k objects, 3823 GB
>> usage:   9350 GB used, 40298 GB / 49648 GB avail
>> pgs: 13.095% pgs not active
>>  609338/3006663 objects degraded (20.266%)
>>  261953/3006663 objects misplaced (8.712%)
>>  9377/1002221 objects unfound (0.936%)
>>  1101 active+clean
>>  84   recovery_wait+undersized+degraded+remapped+peered
>>  82   undersized+degraded+remapped+backfill_wait+peered
>>  23   active+undersized+degraded+remapped+backfill_wait
>>  18   active+remapped+backfill_wait
>>  14   active+undersized+degraded+remapped+backfilling
>>  10   undersized+degraded+remapped+backfilling+peered
>>  9active+recovery_wait+degraded
>>  3active+remapped+backfilling
>>
>>   io:
>> client:   624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr
>> recovery: 90148 kB/s, 22 objects/s
>>
>> Looking at the unfound objects:
>>
>> [17:32:17] server1:~# ceph health detail
>> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221
>> objects unfound (0.936%); Reduced data availability: 176 pgs inactive;
>> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244
>> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on
>> available space
>> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%)
>> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%)
>> pg 4.fa has 117 unfound objects
>> pg 4.ff has 107 unfound objects
>> pg 4.fd has 113 unfound objects
>> pg 4.f0 has 120 unfound objects
>> 
>>
>>
>> Output from ceph pg 4.fa query:
>>
>> {
>> "state": "recovery_wait+undersized+degraded+remapped+peered",
>> "snap_trimq": "[]",
>> "epoch": 17561,
>> "up": [
>> 8,
>> 

[ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

Hello,

we added about 7 new disks yesterday/today and our cluster became very
slow. While the rebalancing took place, 2 of the 7 new added disks
died.

Our cluster is still recovering, however we spotted that there are a lot
of unfound objects.

We lost osd.63 and osd.64, which seem not to be involved into the sample
pg that has unfound objects.

We were wondering why there are unfound objects, where they are coming
from and if there is a way to recover them?

Any help appreciated,

Best,

Nico


Our status is:

  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
261953/3006663 objects misplaced (8.712%)
9377/1002221 objects unfound (0.936%)
Reduced data availability: 176 pgs inactive
Degraded data redundancy: 609338/3006663 objects degraded 
(20.266%), 243 pgs unclea
n, 222 pgs degraded, 213 pgs undersized
mon server2 is low on available space

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: 2, server2, 0, server3
osd: 54 osds: 54 up, 54 in; 234 remapped pgs

  data:
pools:   3 pools, 1344 pgs
objects: 978k objects, 3823 GB
usage:   9350 GB used, 40298 GB / 49648 GB avail
pgs: 13.095% pgs not active
 609338/3006663 objects degraded (20.266%)
 261953/3006663 objects misplaced (8.712%)
 9377/1002221 objects unfound (0.936%)
 1101 active+clean
 84   recovery_wait+undersized+degraded+remapped+peered
 82   undersized+degraded+remapped+backfill_wait+peered
 23   active+undersized+degraded+remapped+backfill_wait
 18   active+remapped+backfill_wait
 14   active+undersized+degraded+remapped+backfilling
 10   undersized+degraded+remapped+backfilling+peered
 9active+recovery_wait+degraded
 3active+remapped+backfilling

  io:
client:   624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr
recovery: 90148 kB/s, 22 objects/s

Looking at the unfound objects:

[17:32:17] server1:~# ceph health detail
HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 objects 
unfound (0.936%); Reduced data availability: 176 pgs inactive; Degraded data 
redundancy: 612398/3006663 objects degraded (20.368%), 244 pgs unclean, 223 pgs 
degraded, 214 pgs undersized; mon server2 is low on available space
OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%)
OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%)
pg 4.fa has 117 unfound objects
pg 4.ff has 107 unfound objects
pg 4.fd has 113 unfound objects
pg 4.f0 has 120 unfound objects



Output from ceph pg 4.fa query:

{
"state": "recovery_wait+undersized+degraded+remapped+peered",
"snap_trimq": "[]",
"epoch": 17561,
"up": [
8,
17,
25
],
"acting": [
61
],
"backfill_targets": [
"8",
"17",
"25"
],
"actingbackfill": [
"8",
"17",
"25",
"61"
],
"info": {
"pgid": "4.fa",
"last_update": "17529'85051",
"last_complete": "17217'77468",
"log_tail": "17091'75034",
"last_user_version": 85051,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
},
{
"start": "6",
"length": "8"
},
{
"start": "10",
"length": "2"
}
],
"history": {
"epoch_created": 9134,
"epoch_pool_created": 9134,
"last_epoch_started": 17528,
"last_interval_started": 17527,
"last_epoch_clean": 17079,
"last_interval_clean": 17078,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 17143,
"same_interval_since": 17530,
"same_primary_since": 17515,
"last_scrub": "17090'57357",
"last_scrub_stamp": "2018-01-20 20:45:32.616142",
"last_deep_scrub": "17082'54734",
"last_deep_scrub_stamp": "2018-01-15 21:09:34.121488",
"last_clean_scrub_stamp": "2018-01-20 20:45:32.616142"
},
"stats": {
"version": "17529'85051",
"reported_seq": "218453",
"reported_epoch": "17561",
"state": "recovery_wait+undersized+degraded+remapped+peered",
"last_fresh": "2018-01-22 17:42:28.196701",
"last_change": "2018-01-22 15:00:46.507189",
"last_active": "2018-01-22 15:00:44.635399",
"last_peered": "2018-01-22 17:42:28.196701",
"last_clean": "2018-01-21 20:15:48.267209",
"last_became_active": "2018-01-22 14:53:07.918893",

[ceph-users] Adding Monitor ceph freeze, monitor 100% cpu usage

2018-01-06 Thread Nico Schottelius

Hello,

our problems with ceph monitors continue in version 12.2.2:

Adding a specific monitor causes all monitors to hang and not respond to
ceph -s or similar anymore.

Interestingly when this monitor is on (mon.server2), the other two
monitors (mon.server3, mon.server5) randomly begin to consume 100% cpu
time, until we restart them, when the procedure repeats.

The monitor mon.server2 interestingly has a different view on the
cluster: when the other two are electing, it is in state synchronising.

We recently noticed that the MTUs of the bond0 device that we use was
setup to be 9200 and the vlan tagged device bond0.2, that we use for
ceph, also had an 9200 mtu. We raised the underlying devices and bond0
to 9204, restarted the monitors, but the problem persists.

Does anyone have a hint on how to further debug this problem?

I have added the logs from the time when we tried to restart the monitor
on server2.

Best,

Nico



ceph-mon.server2.log.bz2
Description: BZip2 compressed data


ceph-mon.server5.log.bz2
Description: BZip2 compressed data



--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to enable jumbo frames on IPv6 only cluster?

2017-10-27 Thread Nico Schottelius

Hello,

we are running everything IPv6 only. You just need to setup the MTU on
your devices (nics, switches) correctly, nothing ceph or IPv6 specific
required.

If you are using SLAAC (like we do), you can also announce the MTU via
RA.

Best,

Nico



Jack  writes:

> Or maybe you reach that ipv4 directly, and that ipv6 via a router, somehow
>
> Check your routing table and neighbor table
>
> On 27/10/2017 16:02, Wido den Hollander wrote:
>>
>>> Op 27 oktober 2017 om 14:22 schreef Félix Barbeira :
>>>
>>>
>>> Hi,
>>>
>>> I'm trying to configure a ceph cluster using IPv6 only but I can't enable
>>> jumbo frames. I made the definition on the
>>> 'interfaces' file and it seems like the value is applied but when I test it
>>> looks like only works on IPv4, not IPv6.
>>>
>>> It works on IPv4:
>>>
>>> root@ceph-node01:~# ping -c 3 -M do -s 8972 ceph-node02
>>>
>>> PING ceph-node02 (x.x.x.x) 8972(9000) bytes of data.
>>> 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=1 ttl=64 time=0.474 ms
>>> 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=2 ttl=64 time=0.254 ms
>>> 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=3 ttl=64 time=0.288 ms
>>>
>>
>> Verify with Wireshark/tcpdump if it really sends 9k packets. I doubt it.
>>
>>> --- ceph-node02 ping statistics ---
>>> 3 packets transmitted, 3 received, 0% packet loss, time 2000ms
>>> rtt min/avg/max/mdev = 0.254/0.338/0.474/0.099 ms
>>>
>>> root@ceph-node01:~#
>>>
>>> But *not* in IPv6:
>>>
>>> root@ceph-node01:~# ping6 -c 3 -M do -s 8972 ceph-node02
>>> PING ceph-node02(x:x:x:x:x:x:x:x) 8972 data bytes
>>> ping: local error: Message too long, mtu=1500
>>> ping: local error: Message too long, mtu=1500
>>> ping: local error: Message too long, mtu=1500
>>>
>>
>> Like Ronny already mentioned, check the switches and the receiver. There is 
>> a 1500 MTU somewhere configured.
>>
>> Wido
>>
>>> --- ceph-node02 ping statistics ---
>>> 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3024ms
>>>
>>> root@ceph-node01:~#
>>>
>>>
>>>
>>> root@ceph-node01:~# ifconfig
>>> eno1  Link encap:Ethernet  HWaddr 24:6e:96:05:55:f8
>>>   inet6 addr: 2a02:x:x:x:x:x:x:x/64 Scope:Global
>>>   inet6 addr: fe80::266e:96ff:fe05:55f8/64 Scope:Link
>>>   UP BROADCAST RUNNING MULTICAST  *MTU:9000*  Metric:1
>>>   RX packets:633318 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:649607 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:1000
>>>   RX bytes:463355602 (463.3 MB)  TX bytes:498891771 (498.8 MB)
>>>
>>> loLink encap:Local Loopback
>>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>>   inet6 addr: ::1/128 Scope:Host
>>>   UP LOOPBACK RUNNING  MTU:65536  Metric:1
>>>   RX packets:127420 errors:0 dropped:0 overruns:0 frame:0
>>>   TX packets:127420 errors:0 dropped:0 overruns:0 carrier:0
>>>   collisions:0 txqueuelen:1
>>>   RX bytes:179470326 (179.4 MB)  TX bytes:179470326 (179.4 MB)
>>>
>>> root@ceph-node01:~#
>>>
>>> root@ceph-node01:~# cat /etc/network/interfaces
>>> # This file describes network interfaces avaiulable on your system
>>> # and how to activate them. For more information, see interfaces(5).
>>>
>>> source /etc/network/interfaces.d/*
>>>
>>> # The loopback network interface
>>> auto lo
>>> iface lo inet loopback
>>>
>>> # The primary network interface
>>> auto eno1
>>> iface eno1 inet6 auto
>>>post-up ifconfig eno1 mtu 9000
>>> root@ceph-node01:#
>>>
>>>
>>> Please help!
>>>
>>> --
>>> Félix Barbeira.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor

2017-10-18 Thread Nico Schottelius

Hey Joao,

thanks for the pointer! Do you have a timeline for the release of
v12.2.2?

Best,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor

2017-10-18 Thread Nico Schottelius

Hello Joao,

thanks for coming back!

I copied the log of the crashing monitor to 
http://www.nico.schottelius.org/cephmonlog-2017-10-08-v2.xz

Can I somehow get access to the logs of the other monitors, without
restarting them?

I would like to not stop them, as currently we are running with 2/3
monitors and adding a new one seems not easily be possible, because
adding a new one means losing the quorum and then being unable to remove
the new one, because the quorum is lost with 2/4 nodes.

(this is what actually happened about a week ago in our cluster)

Best,

Nico


Joao Eduardo Luis <j...@suse.de> writes:

> Hi Nico,
>
> I'm sorry I forgot about your issue. Crazy few weeks.
>
> I checked the log you initially sent to the list, but it only contains
> the log from one of the monitors, and it's from the one
> synchronizing. This monitor is not stuck however - synchronizing is
> progressing, albeit slowly.
>
> Can you please share the logs of the other monitors, especially of
> those crashing?
>
>   -Joao
>
> On 10/18/2017 06:58 AM, Nico Schottelius wrote:
>>
>> Hello everyone,
>>
>> is there any solution in sight for this problem? Currently our cluster
>> is stuck with a 2 monitor configuration, as everytime we restart the one
>> server2, it crashes after some minutes (and in between the cluster is stuck).
>>
>> Should we consider downgrading to kraken to fix that problem?
>>
>> Best,
>>
>> Nico
>>
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor

2017-10-17 Thread Nico Schottelius

Hello everyone,

is there any solution in sight for this problem? Currently our cluster
is stuck with a 2 monitor configuration, as everytime we restart the one
server2, it crashes after some minutes (and in between the cluster is stuck).

Should we consider downgrading to kraken to fix that problem?

Best,

Nico


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor

2017-10-09 Thread Nico Schottelius

Good morning Joao,

thanks for your feedback! We do actually have three managers running:

  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
1/3 mons down, quorum server5,server3

  services:
mon: 3 daemons, quorum server5,server3, out of quorum: server2
mgr: 0(active), standbys: 1, 2
osd: 57 osds: 57 up, 57 in

  data:
pools:   3 pools, 1344 pgs
objects: 580k objects, 2256 GB
usage:   6778 GB used, 30276 GB / 37054 GB avail
pgs: 1344 active+clean

  io:
client:   17705 B/s rd, 14586 kB/s wr, 21 op/s rd, 70 op/s wr



Joao Eduardo Luis  writes:

> This looks a lot like a bug I fixed a week or so ago, but for which I 
> currently don't recall the ticket off the top of my head. It was basically a 
> crash each time a "ceph osd df" was called, if a mgr was not available after 
> having set the
> luminous osd require flag. I will check the log in the morning to figure out 
> whether you need to upgrade to a newer version or if this is a corner case 
> the fix missed. In the mean time, check if you have ceph-mgr running, because 
> that's
> the easy work around (assuming it's the same bug).
>
> -Joao


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [CLUSTER STUCK] Luminous cluster stuck when adding monitor

2017-10-08 Thread Nico Schottelius

After spending some hours on debugging packets on the wire, without
seeing a good reason for things not to work, the monitor on server2
eventually joined the quorum.

Being happy for some time and then our alarming sends a message that the
quorum is lost. And indeed, the monitor on server2 died and now comes
the not so funny part: restarting the monitor makes the cluster hang again.

I will post another debug log in the next hours, now from the monitor on
server2.



Nico Schottelius <nico.schottel...@ungleich.ch> writes:

> Not sure if I mentioned before: adding a new monitor also puts the whole
> cluster into stuck state.
>
> Some minutes ago I did:
>
> root@server1:~# ceph mon add server2 2a0a:e5c0::92e2:baff:fe4e:6614
> port defaulted to 6789; adding mon.server2 at 
> [2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0
>
> And then started the daemon on server2:
>
> ceph-mon -i server2 --pid-file  /var/lib/ceph/run/mon.server2.pid -c 
> /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | 
> tee ~/cephmonlog-2017-10-08-2
>
> And now the cluster hangs (as in ceph -s does not return).
>
> Looking at mon_status of server5, shows that server5 thinks it is time
> for electing [0].
>
> When stopping the monitor on server2 and trying to remove server2 again,
> the removal command also gets stuck and never returns:
>
> root@server1:~# ceph mon rm server2
>
> As our cluster is now severely degraded, I was wondering if anyone has a
> quick hint on how to get ceph -s back working and/or remove server2
> and/or how to readd server1?
>
> Best,
>
> Nico
>
>
> [0]
>
> [10:50:38] server5:~# ceph daemon mon.server5 mon_status
> {
> "name": "server5",
> "rank": 0,
> "state": "electing",
> "election_epoch": 6087,
> "quorum": [],
> "features": {
> "required_con": "153140804152475648",
> "required_mon": [
> "kraken",
> "luminous"
> ],
> "quorum_con": "2305244844532236283",
> "quorum_mon": [
> "kraken",
> "luminous"
> ]
> },
> "outside_quorum": [],
> "extra_probe_peers": [
> "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0"
> ],
> "sync_provider": [],
> "monmap": {
> "epoch": 11,
> "fsid": "26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab",
> "modified": "2017-10-08 10:43:49.667986",
> "created": "2017-05-16 22:33:04.500528",
> "features": {
> "persistent": [
> "kraken",
> "luminous"
> ],
> "optional": []
> },
> "mons": [
> {
> "rank": 0,
> "name": "server5",
> "addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0",
> "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0"
> },
> {
> "rank": 1,
> "name": "server3",
> "addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0",
> "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0"
> },
> {
> "rank": 2,
> "name": "server2",
> "addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0",
> "public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0"
> },
> {
> "rank": 3,
> "name": "server1",
> "addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0",
> "public_addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0"
> }
> ]
> },
> "feature_map": {
> "mon": {
> "group": {
> "features": "0x1ffddff8eea4fffb",
> "release": "luminous",
> "num": 1
> }
> },
> "client": {
> "group": {
> "features": "0x1

Re: [ceph-users] [CLUSTER STUCK] Luminous cluster stuck when adding monitor

2017-10-08 Thread Nico Schottelius

Not sure if I mentioned before: adding a new monitor also puts the whole
cluster into stuck state.

Some minutes ago I did:

root@server1:~# ceph mon add server2 2a0a:e5c0::92e2:baff:fe4e:6614
port defaulted to 6789; adding mon.server2 at 
[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0

And then started the daemon on server2:

ceph-mon -i server2 --pid-file  /var/lib/ceph/run/mon.server2.pid -c 
/etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | tee 
~/cephmonlog-2017-10-08-2

And now the cluster hangs (as in ceph -s does not return).

Looking at mon_status of server5, shows that server5 thinks it is time
for electing [0].

When stopping the monitor on server2 and trying to remove server2 again,
the removal command also gets stuck and never returns:

root@server1:~# ceph mon rm server2

As our cluster is now severely degraded, I was wondering if anyone has a
quick hint on how to get ceph -s back working and/or remove server2
and/or how to readd server1?

Best,

Nico


[0]

[10:50:38] server5:~# ceph daemon mon.server5 mon_status
{
"name": "server5",
"rank": 0,
"state": "electing",
"election_epoch": 6087,
"quorum": [],
"features": {
"required_con": "153140804152475648",
"required_mon": [
"kraken",
"luminous"
],
"quorum_con": "2305244844532236283",
"quorum_mon": [
"kraken",
"luminous"
]
},
"outside_quorum": [],
"extra_probe_peers": [
"[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0"
],
"sync_provider": [],
"monmap": {
"epoch": 11,
"fsid": "26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab",
"modified": "2017-10-08 10:43:49.667986",
"created": "2017-05-16 22:33:04.500528",
"features": {
"persistent": [
"kraken",
"luminous"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "server5",
"addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0",
"public_addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0"
},
{
"rank": 1,
"name": "server3",
"addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0",
"public_addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0"
},
{
"rank": 2,
"name": "server2",
"addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0",
"public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0"
},
{
"rank": 3,
"name": "server1",
"addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0",
"public_addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0"
}
]
},
"feature_map": {
"mon": {
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 1
}
},
"client": {
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 4
}
}
}
}




Nico Schottelius <nico.schottel...@ungleich.ch> writes:

> Good evening Joao,
>
> we double checked our MTUs, they are all 9200 on the servers and 9212 on
> the switches. And we have no problems transferring big files in general
> (as opennebula copies around images for importing, we do this quite a
> lot).
>
> So if you could have a look, it would be much appreciated.
>
> If we should collect other logs, just let us know.
>
> Best,
>
> Nico
>
> Joao Eduardo Luis <j...@suse.de> writes:
>
>> On 10/04/2017 09:19 PM, Gregory Farnum wrote:
>>> Oh, hmm, you're right. I see synchronization starts but it seems to
>>> progress very slowly, and it certainly doesn't complete in that 2.5
>>> minute logging window. I don't see any clear reason why it's so
>>> slow; it might be more clear if you could provide l

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-07 Thread Nico Schottelius

Good evening Joao,

we double checked our MTUs, they are all 9200 on the servers and 9212 on
the switches. And we have no problems transferring big files in general
(as opennebula copies around images for importing, we do this quite a
lot).

So if you could have a look, it would be much appreciated.

If we should collect other logs, just let us know.

Best,

Nico

Joao Eduardo Luis <j...@suse.de> writes:

> On 10/04/2017 09:19 PM, Gregory Farnum wrote:
>> Oh, hmm, you're right. I see synchronization starts but it seems to
>> progress very slowly, and it certainly doesn't complete in that 2.5
>> minute logging window. I don't see any clear reason why it's so
>> slow; it might be more clear if you could provide logs of the other
>> logs at the same time (especially since you now say they are getting
>> stuck in the electing state during that period). Perhaps Kefu or
>> Joao will have some clearer idea what the problem is.
>> -Greg
>
> I haven't gone through logs yet (maybe Friday, it's late today and
> it's a holiday tomorrow), but not so long ago I seem to recall someone
> having a similar issue with the monitors that was solely related to a
> switch's MTU being too small.
>
> Maybe that could be the case? If not, I'll take a look at the logs as
> soon as possible.
>
>   -Joao
>
>>
>> On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius
>> <nico.schottel...@ungleich.ch <mailto:nico.schottel...@ungleich.ch>>
>> wrote:
>>
>>
>> Some more detail:
>>
>> when restarting the monitor on server1, it stays in synchronizing state
>> forever.
>>
>> However the other two monitors change into electing state.
>>
>> I have double checked that there are not (host) firewalls active and
>> that the times are within 1 second different of the hosts (they all have
>> ntpd running).
>>
>> We are running everything on IPv6, but this should not be a problem,
>> should it?
>>
>> Best,
>>
>> Nico
>>
>>
>> Nico Schottelius <nico.schottel...@ungleich.ch
>> <mailto:nico.schottel...@ungleich.ch>> writes:
>>
>>  > Hello Gregory,
>>  >
>>  > the logfile I produced has already debug mon = 20 set:
>>  >
>>  > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
>>  > debug mon = 20
>>  >
>>  > It is clear that server1 is out of quorum, however how do we make it
>>  > being part of the quorum again?
>>  >
>>  > I expected that the quorum finding process is triggered automatically
>>  > after restarting the monitor, or is that incorrect?
>>  >
>>  > Best,
>>  >
>>  > Nico
>>  >
>>  >
>>  > Gregory Farnum <gfar...@redhat.com <mailto:gfar...@redhat.com>>
>> writes:
>>  >
>>  >> You'll need to change the config so that it's running "debug mon
>> = 20" for
>>  >> the log to be very useful here. It does say that it's dropping
>> client
>>  >> connections because it's been out of quorum for too long, which
>> is the
>>  >> correct behavior in general. I'd imagine that you've got clients
>> trying to
>>  >> connect to the new monitor instead of the ones already in the
>> quorum and
>>  >> not passing around correctly; this is all configurable.
>>  >>
>>  >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
>>  >> nico.schottel...@ungleich.ch
>> <mailto:nico.schottel...@ungleich.ch>> wrote:
>>  >>
>>  >>>
>>  >>> Good morning,
>>  >>>
>>  >>> we have recently upgraded our kraken cluster to luminous and
>> since then
>>  >>> noticed an odd behaviour: we cannot add a monitor anymore.
>>  >>>
>>  >>> As soon as we start a new monitor (server2), ceph -s and ceph
>> -w start to
>>  >>> hang.
>>  >>>
>>  >>> The situation became worse, since one of our staff stopped an
>> existing
>>  >>> monitor (server1), as restarting that monitor results in the same
>>  >>> situation, ceph -s hangs until we stop the monitor again.
>>  >>>
>>  >>> We kept the monitor running for some minutes, but the situation
&g

Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius

Some more detail:

when restarting the monitor on server1, it stays in synchronizing state
forever.

However the other two monitors change into electing state.

I have double checked that there are not (host) firewalls active and
that the times are within 1 second different of the hosts (they all have
ntpd running).

We are running everything on IPv6, but this should not be a problem,
should it?

Best,

Nico


Nico Schottelius <nico.schottel...@ungleich.ch> writes:

> Hello Gregory,
>
> the logfile I produced has already debug mon = 20 set:
>
> [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
> debug mon = 20
>
> It is clear that server1 is out of quorum, however how do we make it
> being part of the quorum again?
>
> I expected that the quorum finding process is triggered automatically
> after restarting the monitor, or is that incorrect?
>
> Best,
>
> Nico
>
>
> Gregory Farnum <gfar...@redhat.com> writes:
>
>> You'll need to change the config so that it's running "debug mon = 20" for
>> the log to be very useful here. It does say that it's dropping client
>> connections because it's been out of quorum for too long, which is the
>> correct behavior in general. I'd imagine that you've got clients trying to
>> connect to the new monitor instead of the ones already in the quorum and
>> not passing around correctly; this is all configurable.
>>
>> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> Good morning,
>>>
>>> we have recently upgraded our kraken cluster to luminous and since then
>>> noticed an odd behaviour: we cannot add a monitor anymore.
>>>
>>> As soon as we start a new monitor (server2), ceph -s and ceph -w start to
>>> hang.
>>>
>>> The situation became worse, since one of our staff stopped an existing
>>> monitor (server1), as restarting that monitor results in the same
>>> situation, ceph -s hangs until we stop the monitor again.
>>>
>>> We kept the monitor running for some minutes, but the situation never
>>> cleares up.
>>>
>>> The network does not have any firewall in between the nodes and there
>>> are no host firewalls.
>>>
>>> I have attached the output of the monitor on server1, running in
>>> foreground using
>>>
>>> root@server1:~# ceph-mon -i server1 --pid-file
>>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
>>>
>>> Does anyone see any obvious problem in the attached log?
>>>
>>> Any input or hint would be appreciated!
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>>
>>> --
>>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius

Hello Gregory,

the logfile I produced has already debug mon = 20 set:

[21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
debug mon = 20

It is clear that server1 is out of quorum, however how do we make it
being part of the quorum again?

I expected that the quorum finding process is triggered automatically
after restarting the monitor, or is that incorrect?

Best,

Nico


Gregory Farnum <gfar...@redhat.com> writes:

> You'll need to change the config so that it's running "debug mon = 20" for
> the log to be very useful here. It does say that it's dropping client
> connections because it's been out of quorum for too long, which is the
> correct behavior in general. I'd imagine that you've got clients trying to
> connect to the new monitor instead of the ones already in the quorum and
> not passing around correctly; this is all configurable.
>
> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Good morning,
>>
>> we have recently upgraded our kraken cluster to luminous and since then
>> noticed an odd behaviour: we cannot add a monitor anymore.
>>
>> As soon as we start a new monitor (server2), ceph -s and ceph -w start to
>> hang.
>>
>> The situation became worse, since one of our staff stopped an existing
>> monitor (server1), as restarting that monitor results in the same
>> situation, ceph -s hangs until we stop the monitor again.
>>
>> We kept the monitor running for some minutes, but the situation never
>> cleares up.
>>
>> The network does not have any firewall in between the nodes and there
>> are no host firewalls.
>>
>> I have attached the output of the monitor on server1, running in
>> foreground using
>>
>> root@server1:~# ceph-mon -i server1 --pid-file
>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph
>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
>>
>> Does anyone see any obvious problem in the attached log?
>>
>> Any input or hint would be appreciated!
>>
>> Best,
>>
>> Nico
>>
>>
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius

Good morning,

we have recently upgraded our kraken cluster to luminous and since then
noticed an odd behaviour: we cannot add a monitor anymore.

As soon as we start a new monitor (server2), ceph -s and ceph -w start to hang.

The situation became worse, since one of our staff stopped an existing
monitor (server1), as restarting that monitor results in the same
situation, ceph -s hangs until we stop the monitor again.

We kept the monitor running for some minutes, but the situation never
cleares up.

The network does not have any firewall in between the nodes and there
are no host firewalls.

I have attached the output of the monitor on server1, running in
foreground using

root@server1:~# ceph-mon -i server1 --pid-file  
/var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph 
--setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog

Does anyone see any obvious problem in the attached log?

Any input or hint would be appreciated!

Best,

Nico



cephmonlog.bz2
Description: BZip2 compressed data


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-12 Thread Nico Schottelius

Well, we basically needed to fix it, that's why did it :-)


Blair Bethwaite <blair.bethwa...@gmail.com> writes:

> Great to see this issue sorted.
>
> I have to say I am quite surprised anyone would implement the
> export/import workaround mentioned here without *first* racing to this
> ML or IRC and crying out for help. This is a valuable resource, made
> more so by people sharing issues.
>
> Cheers,
>
> On 12 September 2017 at 07:22, Jason Dillaman <jdill...@redhat.com> wrote:
>> Yes -- the upgrade documentation definitely needs to be updated to add
>> a pre-monitor upgrade step to verify your caps before proceeding -- I
>> will take care of that under this ticket [1]. I believe the OpenStack
>> documentation has been updated [2], but let me know if you find other
>> places.
>>
>> [1] http://tracker.ceph.com/issues/21353
>> [2] 
>> http://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication
>>
>> On Mon, Sep 11, 2017 at 5:16 PM, Nico Schottelius
>> <nico.schottel...@ungleich.ch> wrote:
>>>
>>> That indeed worked! Thanks a lot!
>>>
>>> The remaining question from my side: did we do anything wrong in the
>>> upgrade process and if not, should it be documented somewhere how to
>>> setup the permissions correctly on upgrade?
>>>
>>> Or should the documentation on the side of the cloud infrastructure
>>> software be updated?
>>>
>>>
>>>
>>> Jason Dillaman <jdill...@redhat.com> writes:
>>>
>>>> Since you have already upgraded to Luminous, the fastest and probably
>>>> easiest way to fix this is to run "ceph auth caps client.libvirt mon
>>>> 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides
>>>> simplified RBD caps via named profiles which ensure all the correct
>>>> permissions are enabled.
>>>>
>>>> [1] 
>>>> http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities
>>>
>>> --
>>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>
>>
>>
>> --
>> Jason
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

For opennebula this would be

http://docs.opennebula.org/5.4/deployment/open_cloud_storage_setup/ceph_ds.html

(added opennebula in CC)

Jason Dillaman <jdill...@redhat.com> writes:

> Yes -- the upgrade documentation definitely needs to be updated to add
> a pre-monitor upgrade step to verify your caps before proceeding -- I
> will take care of that under this ticket [1]. I believe the OpenStack
> documentation has been updated [2], but let me know if you find other
> places.
>
> [1] http://tracker.ceph.com/issues/21353
> [2] 
> http://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication
>
> On Mon, Sep 11, 2017 at 5:16 PM, Nico Schottelius
> <nico.schottel...@ungleich.ch> wrote:
>>
>> That indeed worked! Thanks a lot!
>>
>> The remaining question from my side: did we do anything wrong in the
>> upgrade process and if not, should it be documented somewhere how to
>> setup the permissions correctly on upgrade?
>>
>> Or should the documentation on the side of the cloud infrastructure
>> software be updated?
>>
>>
>>
>> Jason Dillaman <jdill...@redhat.com> writes:
>>
>>> Since you have already upgraded to Luminous, the fastest and probably
>>> easiest way to fix this is to run "ceph auth caps client.libvirt mon
>>> 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides
>>> simplified RBD caps via named profiles which ensure all the correct
>>> permissions are enabled.
>>>
>>> [1] 
>>> http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

That indeed worked! Thanks a lot!

The remaining question from my side: did we do anything wrong in the
upgrade process and if not, should it be documented somewhere how to
setup the permissions correctly on upgrade?

Or should the documentation on the side of the cloud infrastructure
software be updated?



Jason Dillaman  writes:

> Since you have already upgraded to Luminous, the fastest and probably
> easiest way to fix this is to run "ceph auth caps client.libvirt mon
> 'profile rbd' osd 'profile rbd pool=one'" [1]. Luminous provides
> simplified RBD caps via named profiles which ensure all the correct
> permissions are enabled.
>
> [1] 
> http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

The only errors message I see is from dmesg when trying to accessing the
XFS filesystem (see attached image).

Let me know if you need any more logs - luckily I can spin up this VM in
a broken state as often as you want to :-)


Jason Dillaman <jdill...@redhat.com> writes:

> ... also, do have any logs from the OS associated w/ this log file? I
> am specifically looking for anything to indicate which sector was
> considered corrupt.
>
> On Mon, Sep 11, 2017 at 4:41 PM, Jason Dillaman <jdill...@redhat.com> wrote:
>> Thanks -- I'll take a look to see if anything else stands out. That
>> "Exec format error" isn't actually an issue -- but now that I know
>> about it, we can prevent it from happening in the future [1]
>>
>> [1] http://tracker.ceph.com/issues/21360
>>
>> On Mon, Sep 11, 2017 at 4:32 PM, Nico Schottelius
>> <nico.schottel...@ungleich.ch> wrote:
>>>
>>>
>>> Thanks a lot for the great ceph.conf pointer, Mykola!
>>>
>>> I found something interesting:
>>>
>>> 2017-09-11 22:26:23.418796 7efd7d479700 10 client.1039597.objecter 
>>> ms_dispatch 0x55b55ab8f950 osd_op_reply(4 rbd_header.df7343d1b58ba [call] 
>>> v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8
>>> 2017-09-11 22:26:23.439501 7efd7dc7a700 10 client.1039597.objecter
>>> ms_dispatch 0x55b55ab8f950 osd_op_reply(14 rbd_header.2b0c02ae8944a
>>> [call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8
>>>
>>> Not sure if those are the ones causing the problem, but at least some
>>> error.
>>>
>>> I have attached the bzip'd log file for reference (1.7MiB, hope it makes
>>> it to the list) and wonder if anyone sees the real reason for the I/O 
>>> errors?
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Mykola Golub <mgo...@mirantis.com> writes:
>>>
>>>> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote:
>>>>>
>>>>> Just tried and there is not much more log in ceph -w (see below) neither
>>>>> from the qemu process.
>>>>>
>>>>> [15:52:43] server4:~$  /usr/bin/qemu-system-x86_64 -name one-17031 -S
>>>>> -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off
>>>>> -smp 6,sockets=6,cores=1,threads=1 -uuid
>>>>> 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults
>>>>> -chardev
>>>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait
>>>>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
>>>>> -no-shutdown -boot strict=on -device
>>>>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
>>>>> file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none'
>>>>>  -device 
>>>>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>>>>>  -drive 
>>>>> file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw
>>>>>  -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc 
>>>>> [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
>>>>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 
>>>>> | tee kvmlogwithdebug
>>>>>
>>>>> -> no output
>>>>
>>>> Try to find where the qemu process writes the ceph log, e.g. with the
>>>> help of lsof utility. Or add something like below
>>>>
>>>>  log file = /tmp/ceph.$name.$pid.log
>>>>
>>>> to ceph.conf before starting qemu and look for /tmp/ceph.*.log
>>>
>>>
>>> --
>>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>>
>>
>>
>>
>> --
>> Jason


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius


Thanks a lot for the great ceph.conf pointer, Mykola!

I found something interesting:

2017-09-11 22:26:23.418796 7efd7d479700 10 client.1039597.objecter ms_dispatch 
0x55b55ab8f950 osd_op_reply(4 rbd_header.df7343d1b58ba [call] v0'0 uv0 ondisk = 
-8 ((8) Exec format error)) v8
2017-09-11 22:26:23.439501 7efd7dc7a700 10 client.1039597.objecter
ms_dispatch 0x55b55ab8f950 osd_op_reply(14 rbd_header.2b0c02ae8944a
[call] v0'0 uv0 ondisk = -8 ((8) Exec format error)) v8

Not sure if those are the ones causing the problem, but at least some
error.

I have uploaded the log at
http://www.nico.schottelius.org/ceph.client.libvirt.41670.log.bz2

I wonder if anyone sees the real reason for the I/O errors in the log?

Best,

Nico

> Mykola Golub <mgo...@mirantis.com> writes:
>
>> On Sun, Sep 10, 2017 at 03:56:21PM +0200, Nico Schottelius wrote:
>>>
>>> Just tried and there is not much more log in ceph -w (see below) neither
>>> from the qemu process.
>>>
>>> [15:52:43] server4:~$  /usr/bin/qemu-system-x86_64 -name one-17031 -S
>>> -machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off
>>> -smp 6,sockets=6,cores=1,threads=1 -uuid
>>> 79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults
>>> -chardev
>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait
>>> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
>>> -no-shutdown -boot strict=on -device
>>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
>>> file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none'
>>>  -device 
>>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>>>  -drive 
>>> file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw
>>>  -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc 
>>> [::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
>>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 | 
>>> tee kvmlogwithdebug
>>>
>>> -> no output
>>
>> Try to find where the qemu process writes the ceph log, e.g. with the
>> help of lsof utility. Or add something like below
>>
>>  log file = /tmp/ceph.$name.$pid.log
>>
>> to ceph.conf before starting qemu and look for /tmp/ceph.*.log


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Sarunas,

may I ask when this happened?

And did you move OSDs or mons after that export/import procecdure?

I really wonder, what is the reason for this behaviour and also if it is
likely to experience it again.

Best,

Nico

Sarunas Burdulis <saru...@math.dartmouth.edu> writes:

> On 2017-09-10 08:23, Nico Schottelius wrote:
>>
>> Good morning,
>>
>> yesterday we had an unpleasant surprise that I would like to discuss:
>>
>> Many (not all!) of our VMs were suddenly
>> dying (qemu process exiting) and when trying to restart them, inside the
>> qemu process we saw i/o errors on the disks and the OS was not able to
>> start (i.e. stopped in initramfs).
>
> We experienced the same after upgrade from kraken to luminous, i.e. all
> VM with their system images in Ceph pool failed to boot due to
> filesystem errors, ending in initramfs. fsck wasn't able to fix them.
>
>> When we exported the image from rbd and loop mounted it, there were
>> however no I/O errors and the filesystem could be cleanly mounted [-1].
>
> Same here.
>
> We ended up rbd-exporting images from Ceph rbd pool to local filesystem
> and re-exporting them back. That "fixed" them without the need for fsck.


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-11 Thread Nico Schottelius

Good morning Lionel,

it's great to hear that it's not only us being affected!

I am not sure what you refer to by "glance" images, but what we see is
that we can spawn a new VM based on an existing image and that one runs.

Can I invite you (and anyone else who has problems w/ Luminous upgrade)
to join our chat at https://brandnewchat.ungleich.ch/ so that we can
discuss online the real world problems?

For us it is currently very unclear how to progress, if it is even save
to rejoin the host into the cluster or if a downgrade would even make
sense.

Best,

Nico

p.s.: This cluster was installed with kraken, so no old jewel clients or
osds have existed at all.

Beard Lionel (BOSTON-STORAGE) <lbe...@cls.fr> writes:

> Hi,
>
> We also have the same issue with Openstack instances (QEMU/libvirt) after 
> upgrading from kraken to luminous, and just after starting osd migration from 
> btrfs to bluestore.
> We were able to restart failed VMs by mounting all disks from a linux box 
> with rbd map, and run fsck on them.
> QEMU hosts are running Ubuntu with kernel > 4.4.
> We have noticed that one of our QEMU hosts was still running jewel ceph 
> client (error during installation...) , and issue doesn't happen on this one.
>
> Don't you have issues with some glance images?
> Because we do (unable to spawn an instance from them), and it was fixed by 
> following this ticket: http://tracker.ceph.com/issues/19413
>
> Regards,
> Lionel
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Nico Schottelius
>> Sent: dimanche 10 septembre 2017 14:23
>> To: ceph-users <ceph-us...@ceph.com>
>> Cc: kamila.souck...@ungleich.ch
>> Subject: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd
>> change]
>>
>>
>> Good morning,
>>
>> yesterday we had an unpleasant surprise that I would like to discuss:
>>
>> Many (not all!) of our VMs were suddenly dying (qemu process exiting) and
>> when trying to restart them, inside the qemu process we saw i/o errors on
>> the disks and the OS was not able to start (i.e. stopped in initramfs).
>>
>> When we exported the image from rbd and loop mounted it, there were
>> however no I/O errors and the filesystem could be cleanly mounted [-1].
>>
>> We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are
>> some problems reported with kernels < 3.16.39 and thus we upgraded one
>> host that serves as VM host + runs ceph osds to Devuan ascii using 4.9.0-3-
>> amd64.
>>
>> Trying to start the VM again on this host however resulted in the same I/O
>> problem.
>>
>> We then did the "stupid" approach of exporting an image and importing it
>> again as the same name [0]. Surprisingly, this solved our problem
>> reproducible for all affected VMs and allowed us to go back online.
>>
>> We intentionally left one broken VM in our system (a test VM) so that we
>> have the chance of debugging further what happened and how we can
>> prevent it from happening again.
>>
>> As you might have guessed, there have been some event prior this:
>>
>> - Some weeks before we upgraded our cluster from kraken to luminous (in
>>   the right order of mon's first, adding mgrs)
>>
>> - About a week ago we added the first hdd to our cluster and modified the
>>   crushmap so that it the "one" pool (from opennebula) still selects
>>   only ssds
>>
>> - Some hours before we took out one of the 5 hosts of the ceph cluster,
>>   as we intended to replace the filesystem based OSDs with bluestore
>>   (roughly 3 hours prior to the event)
>>
>> - Short time before the event we readded an osd, but did not "up" it
>>
>> To our understanding, none of these actions should have triggered this
>> behaviour, however we are aware that with the upgrade to luminous also
>> the client libraries were updated and not all qemu processes were restarted.
>> [1]
>>
>> After this long story, I was wondering about the following things:
>>
>> - Why did this happen at all?
>>   And what is different after we reimported the image?
>>   Can it be related to disconnected the image from the parent
>>   (i.e. opennebula creates clones prior to starting a VM)
>>
>> - We have one broken VM left - is there a way to get it back running
>>   without doing the export/import dance?
>>
>> - How / or is http://tracker.ceph.com/issues/18807 related to our issue?
>>   How is the kernel involved into running VMs that use librbd?
>>   rbd showmapped does not show any mappe

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-10 Thread Nico Schottelius

Just tried and there is not much more log in ceph -w (see below) neither
from the qemu process.

[15:52:43] server4:~$  /usr/bin/qemu-system-x86_64 -name one-17031 -S
-machine pc-i440fx-2.1,accel=kvm,usb=off -m 8192 -realtime mlock=off
-smp 6,sockets=6,cores=1,threads=1 -uuid
79845fca-9b26-4072-bcb3-7f5206c2a531 -no-user-config -nodefaults
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-17031.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file='rbd:one/one-29-17031-0:id=libvirt:key=DELETEME:auth_supported=cephx\;none:mon_host=server1\:6789\;server3\:6789\;server5\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=none'
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/var/lib/one//datastores/100/17031/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw
 -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -vnc 
[::]:21131 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on 2>&1 | tee 
kvmlogwithdebug

-> no output

The command line of qemu is copied out of what opennebula usually
spawns, minus the networking part.


[15:41:54] server4:~# ceph -w
2017-09-10 15:44:32.873281 7f59f17fa700 10 client.?.objecter ms_handle_connect 
0x7f59f4150e90
2017-09-10 15:44:32.873315 7f59f17fa700 10 client.?.objecter resend_mon_ops
2017-09-10 15:44:32.873327 7f59f17fa700 10 client.?.objecter ms_handle_connect 
0x7f59f41544d0
2017-09-10 15:44:32.873329 7f59f17fa700 10 client.?.objecter resend_mon_ops
2017-09-10 15:44:32.876248 7f59f9a63700 10 client.1021613.objecter 
_maybe_request_map subscribing (onetime) to next osd map
2017-09-10 15:44:32.876710 7f59f17fa700 10 client.1021613.objecter ms_dispatch 
0x7f59f4000fe0 osd_map(9059..9059 src has 8530..9059) v3
2017-09-10 15:44:32.876722 7f59f17fa700  3 client.1021613.objecter 
handle_osd_map got epochs [9059,9059] > 0
2017-09-10 15:44:32.876726 7f59f17fa700  3 client.1021613.objecter 
handle_osd_map decoding full epoch 9059
2017-09-10 15:44:32.877099 7f59f17fa700 20 client.1021613.objecter dump_active 
.. 0 homeless
2017-09-10 15:44:32.877423 7f59f17fa700 10 client.1021613.objecter 
ms_handle_connect 0x7f59dc00c9c0
  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_OK

  services:
mon: 3 daemons, quorum server5,server3,server1
mgr: 1(active), standbys: 2, 0
osd: 50 osds: 49 up, 49 in

  data:
pools:   2 pools, 1088 pgs
objects: 500k objects, 1962 GB
usage:   5914 GB used, 9757 GB / 15672 GB avail
pgs: 1088 active+clean

  io:
client:   18822 B/s rd, 799 kB/s wr, 6 op/s rd, 52 op/s wr


2017-09-10 15:44:37.876324 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:42.876437 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:45.223970 7f59f17fa700 10 client.1021613.objecter ms_dispatch 
0x7f59f4000fe0 log(2 entries from seq 215046 at 2017-09-10 15:44:45.164162) v1
2017-09-10 15:44:47.876548 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:52.876668 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:44:57.876770 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:02.876888 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:07.877001 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:12.877120 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:17.877229 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:22.877349 7f59f1ffb700 10 client.1021613.objecter tick
2017-09-10 15:45:27.877455 7f59f1ffb700 10 client.1021613.objecter tick

Jason Dillaman <jdill...@redhat.com> writes:

> Sorry -- meant VM. Yes, librbd uses ceph.conf for configuration settings.
>
> On Sun, Sep 10, 2017 at 9:22 AM, Nico Schottelius
> <nico.schottel...@ungleich.ch> wrote:
>>
>> Hello Jason,
>>
>> I think there is a slight misunderstanding:
>> There is only one *VM*, not one OSD left that we did not start.
>>
>> Or does librbd also read ceph.conf and will that cause qemu to output
>> debug messages?
>>
>> Best,
>>
>> Nico
>>
>> Jason Dillaman <jdill...@redhat.com> writes:
>>
>>> I presume QEMU is using librbd instead of a mapped krbd block device,
>>> correct? If that is the case, can you add "debug-rbd=20" and "debug
>>> objecter=20" to your ceph.conf and boot up your last remaining broken
>>> OSD?
>>>
>>> On Sun, Sep 10, 2017 at 8:23 AM, Nico Schottelius
>>> <nico.schottel...@ungleich.ch> wrote:
>>>>
>>>> Good morning,
>>>>
>>>> yesterday we had an unpleasant surprise that I would like to d

Re: [ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-10 Thread Nico Schottelius

Hello Jason,

I think there is a slight misunderstanding:
There is only one *VM*, not one OSD left that we did not start.

Or does librbd also read ceph.conf and will that cause qemu to output
debug messages?

Best,

Nico

Jason Dillaman <jdill...@redhat.com> writes:

> I presume QEMU is using librbd instead of a mapped krbd block device,
> correct? If that is the case, can you add "debug-rbd=20" and "debug
> objecter=20" to your ceph.conf and boot up your last remaining broken
> OSD?
>
> On Sun, Sep 10, 2017 at 8:23 AM, Nico Schottelius
> <nico.schottel...@ungleich.ch> wrote:
>>
>> Good morning,
>>
>> yesterday we had an unpleasant surprise that I would like to discuss:
>>
>> Many (not all!) of our VMs were suddenly
>> dying (qemu process exiting) and when trying to restart them, inside the
>> qemu process we saw i/o errors on the disks and the OS was not able to
>> start (i.e. stopped in initramfs).
>>
>> When we exported the image from rbd and loop mounted it, there were
>> however no I/O errors and the filesystem could be cleanly mounted [-1].
>>
>> We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are
>> some problems reported with kernels < 3.16.39 and thus we upgraded one
>> host that serves as VM host + runs ceph osds to Devuan ascii using
>> 4.9.0-3-amd64.
>>
>> Trying to start the VM again on this host however resulted in the same
>> I/O problem.
>>
>> We then did the "stupid" approach of exporting an image and importing it
>> again as the same name [0]. Surprisingly, this solved our problem
>> reproducible for all affected VMs and allowed us to go back online.
>>
>> We intentionally left one broken VM in our system (a test VM) so that we
>> have the chance of debugging further what happened and how we can
>> prevent it from happening again.
>>
>> As you might have guessed, there have been some event prior this:
>>
>> - Some weeks before we upgraded our cluster from kraken to luminous (in
>>   the right order of mon's first, adding mgrs)
>>
>> - About a week ago we added the first hdd to our cluster and modified the
>>   crushmap so that it the "one" pool (from opennebula) still selects
>>   only ssds
>>
>> - Some hours before we took out one of the 5 hosts of the ceph cluster,
>>   as we intended to replace the filesystem based OSDs with bluestore
>>   (roughly 3 hours prior to the event)
>>
>> - Short time before the event we readded an osd, but did not "up" it
>>
>> To our understanding, none of these actions should have triggered this
>> behaviour, however we are aware that with the upgrade to luminous also
>> the client libraries were updated and not all qemu processes were
>> restarted. [1]
>>
>> After this long story, I was wondering about the following things:
>>
>> - Why did this happen at all?
>>   And what is different after we reimported the image?
>>   Can it be related to disconnected the image from the parent
>>   (i.e. opennebula creates clones prior to starting a VM)
>>
>> - We have one broken VM left - is there a way to get it back running
>>   without doing the export/import dance?
>>
>> - How / or is http://tracker.ceph.com/issues/18807 related to our issue?
>>   How is the kernel involved into running VMs that use librbd?
>>   rbd showmapped does not show any mapped VMs, as qemu connects directly
>>   to ceph.
>>
>>   We tried upgrading one host to Devuan ascii which uses 4.9.0-3-amd64,
>>   but did not fix our problem.
>>
>> We would appreciate any pointer!
>>
>> Best,
>>
>> Nico
>>
>>
>> [-1]
>> losetup -P /dev/loop0 /var/tmp/one-staging/monitoring1-disk.img
>> mkdir /tmp/monitoring1-mnt
>> mount /dev/loop0p1 /tmp/monitoring1-mnt/
>>
>>
>> [0]
>>
>> rbd export one/$img /var/tmp/one-staging/$img
>> rbd rm one/$img
>> rbd import /var/tmp/one-staging/$img one/$img
>> rm /var/tmp/one-staging/$img
>>
>> [1]
>> [14:05:34] server5:~# ceph features
>> {
>> "mon": {
>> "group": {
>> "features": "0x1ffddff8eea4fffb",
>> "release": "luminous",
>> "num": 3
>> }
>> },
>> "osd": {
>> "group": {
>> "features": "0x1ffddff8eea4fffb",
>> "release": "luminous",
>

[ceph-users] RBD I/O errors with QEMU [luminous upgrade/osd change]

2017-09-10 Thread Nico Schottelius

Good morning,

yesterday we had an unpleasant surprise that I would like to discuss:

Many (not all!) of our VMs were suddenly
dying (qemu process exiting) and when trying to restart them, inside the
qemu process we saw i/o errors on the disks and the OS was not able to
start (i.e. stopped in initramfs).

When we exported the image from rbd and loop mounted it, there were
however no I/O errors and the filesystem could be cleanly mounted [-1].

We are running Devuan with kernel 3.16.0-4-amd64 and saw that there are
some problems reported with kernels < 3.16.39 and thus we upgraded one
host that serves as VM host + runs ceph osds to Devuan ascii using
4.9.0-3-amd64.

Trying to start the VM again on this host however resulted in the same
I/O problem.

We then did the "stupid" approach of exporting an image and importing it
again as the same name [0]. Surprisingly, this solved our problem
reproducible for all affected VMs and allowed us to go back online.

We intentionally left one broken VM in our system (a test VM) so that we
have the chance of debugging further what happened and how we can
prevent it from happening again.

As you might have guessed, there have been some event prior this:

- Some weeks before we upgraded our cluster from kraken to luminous (in
  the right order of mon's first, adding mgrs)

- About a week ago we added the first hdd to our cluster and modified the
  crushmap so that it the "one" pool (from opennebula) still selects
  only ssds

- Some hours before we took out one of the 5 hosts of the ceph cluster,
  as we intended to replace the filesystem based OSDs with bluestore
  (roughly 3 hours prior to the event)

- Short time before the event we readded an osd, but did not "up" it

To our understanding, none of these actions should have triggered this
behaviour, however we are aware that with the upgrade to luminous also
the client libraries were updated and not all qemu processes were
restarted. [1]

After this long story, I was wondering about the following things:

- Why did this happen at all?
  And what is different after we reimported the image?
  Can it be related to disconnected the image from the parent
  (i.e. opennebula creates clones prior to starting a VM)

- We have one broken VM left - is there a way to get it back running
  without doing the export/import dance?

- How / or is http://tracker.ceph.com/issues/18807 related to our issue?
  How is the kernel involved into running VMs that use librbd?
  rbd showmapped does not show any mapped VMs, as qemu connects directly
  to ceph.

  We tried upgrading one host to Devuan ascii which uses 4.9.0-3-amd64,
  but did not fix our problem.

We would appreciate any pointer!

Best,

Nico


[-1]
losetup -P /dev/loop0 /var/tmp/one-staging/monitoring1-disk.img
mkdir /tmp/monitoring1-mnt
mount /dev/loop0p1 /tmp/monitoring1-mnt/


[0]

rbd export one/$img /var/tmp/one-staging/$img
rbd rm one/$img
rbd import /var/tmp/one-staging/$img one/$img
rm /var/tmp/one-staging/$img

[1]
[14:05:34] server5:~# ceph features
{
"mon": {
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 3
}
},
"osd": {
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 49
}
},
"client": {
"group": {
"features": "0xffddff8ee84fffb",
"release": "kraken",
"num": 1
},
"group": {
"features": "0xffddff8eea4fffb",
"release": "luminous",
"num": 4
},
"group": {
"features": "0x1ffddff8eea4fffb",
"release": "luminous",
"num": 61
}
}
}


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Nico Schottelius
Lionel, Christian,

we do have the exactly same trouble as Christian,
namely

Christian Eichelmann [Fri, Jan 09, 2015 at 10:43:20AM +0100]:
 We still don't know what caused this specific error...

and

 ...there is currently no way to make ceph forget about the data of this pg 
 and create it as an empty one. So the only way
 to make this pool usable again is to loose all your data in there. 

I wonder what is the position of ceph developers regarding
dropping (emptying) specific pgs?
Is that a use case that was never thought of or tested?

For us it is essential to be able to keep the pool/cluster
running even in case we have lost pgs.

Even though I do not like the fact that we lost a pg for
an unknown reason, I would prefer ceph to handle that case to recover to
the best possible situation.

Namely I wonder if we can integrate a tool that shows 
which (parts of) rbd images would be affected by dropping
a pg. That would give us the chance to selectively restore
VMs in case this happens again.

Cheers,

Nico

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Nico Schottelius
Good morning Jiri,

sure, let me catch up on this:

- Kernel 3.16
- ceph: 0.80.7
- fs: xfs
- os: debian (backports) (1x)/ubuntu (2x)

Cheers,

Nico

Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]:
 Hi Nico.
 
 If you are experiencing such issues it would be good if you provide more info 
 about your deployment: ceph version, kernel versions, OS, filesystem 
 btrfs/xfs.
 
 Thx Jiri
 
 - Reply message -
 From: Nico Schottelius nico-eph-us...@schottelius.org
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = 
 Cluster unusable]
 Date: Wed, Dec 31, 2014 02:36
 
 Good evening,
 
 we also tried to rescue data *from* our old / broken pool by map'ing the
 rbd devices, mounting them on a host and rsync'ing away as much as
 possible.
 
 However, after some time rsync got completly stuck and eventually the
 host which mounted the rbd mapped devices decided to kernel panic at
 which time we decided to drop the pool and go with a backup.
 
 This story and the one of Christian makes me wonder:
 
 Is anyone using ceph as a backend for qemu VM images in production?
 
 And:
 
 Has anyone on the list been able to recover from a pg incomplete /
 stuck situation like ours?
 
 Reading about the issues on the list here gives me the impression that
 ceph as a software is stuck/incomplete and has not yet become ready
 clean for production (sorry for the word joke).
 
 Cheers,
 
 Nico
 
 Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
  Hi Nico and all others who answered,
  
  After some more trying to somehow get the pgs in a working state (I've
  tried force_create_pg, which was putting then in creating state. But
  that was obviously not true, since after rebooting one of the containing
  osd's it went back to incomplete), I decided to save what can be saved.
  
  I've created a new pool, created a new image there, mapped the old image
  from the old pool and the new image from the new pool to a machine, to
  copy data on posix level.
  
  Unfortunately, formatting the image from the new pool hangs after some
  time. So it seems that the new pool is suffering from the same problem
  as the old pool. Which is totaly not understandable for me.
  
  Right now, it seems like Ceph is giving me no options to either save
  some of the still intact rbd volumes, or to create a new pool along the
  old one to at least enable our clients to send data to ceph again.
  
  To tell the truth, I guess that will result in the end of our ceph
  project (running for already 9 Monthes).
  
  Regards,
  Christian
  
  Am 29.12.2014 15:59, schrieb Nico Schottelius:
   Hey Christian,
   
   Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
   [incomplete PG / RBD hanging, osd lost also not helping]
   
   that is very interesting to hear, because we had a similar situation
   with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
   directories to allow OSDs to start after the disk filled up completly.
   
   So I am sorry not to being able to give you a good hint, but I am very
   interested in seeing your problem solved, as it is a show stopper for
   us, too. (*)
   
   Cheers,
   
   Nico
   
   (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
   seems to run much smoother. The first one is however not supported
   by opennebula directly, the second one not flexible enough to host
   our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
   are using ceph at the moment.
   
  
  
  -- 
  Christian Eichelmann
  Systemadministrator
  
  11 Internet AG - IT Operations Mail  Media Advertising  Targeting
  Brauerstraße 48 · DE-76135 Karlsruhe
  Telefon: +49 721 91374-8026
  christian.eichelm...@1und1.de
  
  Amtsgericht Montabaur / HRB 6484
  Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
  Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
  Aufsichtsratsvorsitzender: Michael Scheeren
 
 -- 
 New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hanging VMs with Qemu + RBD

2015-01-07 Thread Nico Schottelius
Hello Achim,

good to hear someone else running this setup. We have changed the number
of backfills using

ceph tell osd.\* injectargs '--osd-max-backfills 1'

and it seems to work mostly in regards of issues when rebalancing.

One unsolved problem we have is machines kernel panic'ing, when i/o is
slow. We usually see a kernel panic in the sym53c8xx driver, especially for
those VMs with high i/o rates. We tried to upgrade the kernel in the VM
(Debian stable 3.2.0 - Debian backports 3.16.0), but just have
different kernel panic in the same driver now.

Have you had the some problem and if so, how did you get it fixed?

Cheers,

Nico

Achim Ledermüller [Wed, Jan 07, 2015 at 05:42:38PM +0100]:
 Hi,
 
 We have the same setup including OpenNebula 4.10.1. We had some
 backfilling due to node failures and node expansion. If we throttle
 osd_max_backfills there is not a problem at all. If the value for
 backfilling jobs is too high, we can see delayed reactions within the
 shell, eg. `ls -lh` needs 2 seconds.
 
 Kind regards,
 Achim
 
 -- 
 Achim Ledermüller, M. Sc.
 Systems Engineer
 
 NETWAYS Managed Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
 Tel: +49 911 92885-0 | Fax: +49 911 92885-77
 GF: Julian Hein, Bernd Erk | AG Nuernberg HRB25207
 http://www.netways.de | achim.ledermuel...@netways.de
 
 ** OSDC 2015 - April - osdc.de **
 ** Puppet Camp Berlin 2015 - April - netways.de/puppetcamp **
 ** OSBConf 2015 - September – osbconf.org **
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-07 Thread Nico Schottelius
Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
clean for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
 Hi Nico and all others who answered,
 
 After some more trying to somehow get the pgs in a working state (I've
 tried force_create_pg, which was putting then in creating state. But
 that was obviously not true, since after rebooting one of the containing
 osd's it went back to incomplete), I decided to save what can be saved.
 
 I've created a new pool, created a new image there, mapped the old image
 from the old pool and the new image from the new pool to a machine, to
 copy data on posix level.
 
 Unfortunately, formatting the image from the new pool hangs after some
 time. So it seems that the new pool is suffering from the same problem
 as the old pool. Which is totaly not understandable for me.
 
 Right now, it seems like Ceph is giving me no options to either save
 some of the still intact rbd volumes, or to create a new pool along the
 old one to at least enable our clients to send data to ceph again.
 
 To tell the truth, I guess that will result in the end of our ceph
 project (running for already 9 Monthes).
 
 Regards,
 Christian
 
 Am 29.12.2014 15:59, schrieb Nico Schottelius:
  Hey Christian,
  
  Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
  [incomplete PG / RBD hanging, osd lost also not helping]
  
  that is very interesting to hear, because we had a similar situation
  with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
  directories to allow OSDs to start after the disk filled up completly.
  
  So I am sorry not to being able to give you a good hint, but I am very
  interested in seeing your problem solved, as it is a show stopper for
  us, too. (*)
  
  Cheers,
  
  Nico
  
  (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
  seems to run much smoother. The first one is however not supported
  by opennebula directly, the second one not flexible enough to host
  our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
  are using ceph at the moment.
  
 
 
 -- 
 Christian Eichelmann
 Systemadministrator
 
 11 Internet AG - IT Operations Mail  Media Advertising  Targeting
 Brauerstraße 48 · DE-76135 Karlsruhe
 Telefon: +49 721 91374-8026
 christian.eichelm...@1und1.de
 
 Amtsgericht Montabaur / HRB 6484
 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
 Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
 Aufsichtsratsvorsitzender: Michael Scheeren

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Weights: Hosts vs. OSDs

2014-12-30 Thread Nico Schottelius
Good evening,

for some time we have the problem that ceph stores too much data on
a host with small disks. Originally we used weight 1 = 1 TB, but
we reduced the weight for this particular host further to keep it
somehow alive.

Our setup currently consists of 3 hosts:

wein: 6x 136G (fest disks)
kaffee: 1x 5.5T (slow disk)
tee: 1x 5.5T (slow disk)

We originally started with 6 osds on wein with a weight of 0.13, but
had to reduce it to 0.05, because the disks were running full.

The current tree looks as following:

root@wein:~# ceph osd tree
# idweight  type name   up/down reweight
-1  2.3 root default
-2  0.2999  host wein
0   0.04999 osd.0   up  1   
3   0.04999 osd.3   up  1   
4   0.04999 osd.4   up  1   
5   0.04999 osd.5   up  1   
6   0.04999 osd.6   up  1   
7   0.04999 osd.7   up  1   
-3  1   host tee
1   5.5 osd.1   up  1   
-4  1   host kaffee
2   5.5 osd.2   up  1


The hosts have the following disk usage:

root@wein:~# df -h | grep ceph
/dev/sdc1   136G   58G   79G  43% /var/lib/ceph/osd/ceph-0
/dev/sdd1   136G   54G   83G  40% /var/lib/ceph/osd/ceph-3
/dev/sde1   136G   31G  105G  23% /var/lib/ceph/osd/ceph-4
/dev/sdf1   136G   62G   75G  46% /var/lib/ceph/osd/ceph-5
/dev/sdg1   136G   45G   92G  33% /var/lib/ceph/osd/ceph-6
/dev/sdh1   136G   28G  109G  21% /var/lib/ceph/osd/ceph-7

root@kaffee:~# df -h | grep ceph
/dev/sdc  5.5T  983G  4.5T  18% /var/lib/ceph/osd/ceph-2

root@tee:~# df -h | grep ceph
/dev/sdb5.5T  967G  4.6T  18% /var/lib/ceph/osd/ceph-1


On wein 48G are stored on average per osd, tee/kaffee store 952G on average.
(58+64+31+62+45+28)/6 = 48.0
(967+938)/2 = 952.5


The weight relation from wein osd to kaffee/tee osd is 
5.5/0.05 = 110.0

The usage relation from wein osd to kaffee/tee osd is
(967+938)/2) = 952.5
952.5/48 = 19.84375

So ceph is allocating 5.5 times more storage to wein osds than 
what we want it do:
110/(952.5/48) = 5.543307086614173

We are also a bit puzzled that the host weight for wein is 0.3 and
tee/kaffee is 1. So wein is the sum of the OSDs, but kaffee and tee it is not.
However looking at the crushmap, the host weight is being displayed as 5.5!

Has anyone an idea what may be going wrong here? 

While writing this I noted that the relation / factor is exactly 5.5 times
wrong, so I *guess* that ceph treats all hosts with the same weight (even though
it looks differently to me in the osd tree and the crushmap)?

You find our crushmap attached below.

Cheers,

Nico

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host wein {
id -2   # do not change unnecessarily
# weight 0.300
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.050
item osd.3 weight 0.050
item osd.4 weight 0.050
item osd.5 weight 0.050
item osd.6 weight 0.050
item osd.7 weight 0.050
}
host tee {
id -3   # do not change unnecessarily
# weight 5.500
alg straw
hash 0  # rjenkins1
item osd.1 weight 5.500
}
host kaffee {
id -4   # do not change unnecessarily
# weight 5.500
alg straw
hash 0  # rjenkins1
item osd.2 weight 5.500
}
root default {
id -1   # do not change unnecessarily
# weight 2.300
alg straw
hash 0  # rjenkins1
item wein weight 0.300
item tee weight 1.000
item kaffee weight 1.000
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}



-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weights: Hosts vs. OSDs

2014-12-30 Thread Nico Schottelius
Hey Lindsay,

Lindsay Mathieson [Wed, Dec 31, 2014 at 06:23:10AM +1000]:
 On Tue, 30 Dec 2014 05:07:31 PM Nico Schottelius wrote:
  While writing this I noted that the relation / factor is exactly 5.5 times
  wrong, so I *guess* that ceph treats all hosts with the same weight (even
  though it looks differently to me in the osd tree and the crushmap)?
 
 I believe If you have the default replication factor of 3, then with 3 hosts 
 you will effectively  have a weight of 1 per host no matter what you specify 
 because ceph will be forced to place a copy of all data on each host to 
 satisfy replication requirements.

sorry, I forgot to mention we are using size = 2.

Cheers,

Nico

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24


pgpDlVhv_Bove.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph PG Incomplete = Cluster unusable

2014-12-29 Thread Nico Schottelius
Hey Christian,

Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
 [incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.

So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)

Cheers,

Nico

(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
seems to run much smoother. The first one is however not supported
by opennebula directly, the second one not flexible enough to host
our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
are using ceph at the moment.

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Nico Schottelius
Hey Jiri,

also rais the pgp_num (pg != pgp - it's easy to overread).

Cheers,

Nico

Jiri Kanicky [Sun, Dec 28, 2014 at 01:52:39AM +1100]:
 Hi,
 
 I just build my CEPH cluster but having problems with the health of
 the cluster.
 
 Here are few details:
 - I followed the ceph documentation.
 - I used btrfs filesystem for all OSDs
 - I did not set osd pool default size = 2  as I thought that if I
 have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
 was right.
 - I noticed that default pools data,metadata were not created.
 Only rbd pool was created.
 - As it was complaining that the pg_num is too low, I increased the
 pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
 133  pgp_num 64.
 
 Would you give me hint where I have made the mistake? (I can remove
 the OSDs and start over if needed.)
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph health
 HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
 unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
 133  pgp_num 64
 cephadmin@ceph1:/etc/ceph$ sudo ceph status
 cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
  health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
 pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
 rbd pg_num 133  pgp_num 64
  monmap e1: 2 mons at
 {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
 epoch 8, quorum 0,1 ceph1,ceph2
  osdmap e42: 4 osds: 4 up, 4 in
   pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
 11704 kB used, 11154 GB / 11158 GB avail
   29 active+undersized+degraded
  104 active+remapped
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
 # idweight  type name   up/down reweight
 -1  10.88   root default
 -2  5.44host ceph1
 0   2.72osd.0   up  1
 1   2.72osd.1   up  1
 -3  5.44host ceph2
 2   2.72osd.2   up  1
 3   2.72osd.3   up  1
 
 
 cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
 0 rbd,
 
 cephadmin@ceph1:/etc/ceph$ cat ceph.conf
 [global]
 fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 public_network = 192.168.30.0/24
 cluster_network = 10.1.1.0/24
 mon_initial_members = ceph1, ceph2
 mon_host = 192.168.30.21,192.168.30.22
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 
 Thank you
 Jiri

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running instances on ceph with openstack

2014-12-23 Thread Nico Schottelius
Hello Ali Shah,

we are running VMs using Opennebula with ceph as the backend. So far
with varying results: From time to time VMs are freezing, probably
panic'ing when the load is too high on the ceph storage due to rebalance
work.

We are experimenting with --osd-max-backfills 1, but it hasn't solved
the problem completly.

Cheers,

Nico

Zeeshan Ali Shah [Tue, Dec 23, 2014 at 09:12:25AM +0100]:
 Has any one tried running instances over ceph i.e using ceph as backend for
 vm storage . How would you get instant migration in that case since every
 compute host will have it's own RBD . other option is to have a big rbd
 pool on head node and share it with NFS to have shared file system
 
 any idea ?
 
 -- 
 
 Regards
 
 Zeeshan Ali Shah
 System Administrator - PDC HPC
 PhD researcher (IT security)
 Kungliga Tekniska Hogskolan
 +46 8 790 9115
 http://www.pdc.kth.se/members/zashah

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behaviour of a cluster with full OSD(s)

2014-12-23 Thread Nico Schottelius
Max, List,

Max Power [Tue, Dec 23, 2014 at 12:34:54PM +0100]:
 [...Recovering from full osd ...] 
 
 Normally
 the osd process quits then and I cannot restart it (even after setting the
 replicas back). The only possibility is to manually delete complete PG folders
 after exploring them with 'pg dump'. Is this the only way to get it back 
 working
 again?

I was wondering if ceph-osd crashing when the disk gets full shouldn't
be considered being a bug?

Shouldn't ceph osd be able to recover itself? Like if an admin detects
that the disk is full, she can simply reduce the weight of the osd to
free up space. With a dead osd, this is not possible.

To those having deeper ceph knowledge: 

For what reason does ceph-osd exit when the disk is full?
Why can it not start when it is full to get itself out of this 
invidious situation?

Cheers,

Nico

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy state of documentation [was: OSD JOURNAL not associated - ceph-disk list ?]

2014-12-22 Thread Nico Schottelius
Yes - we are mostly using cdist [0] and also plan to support
ceph mid term.

The website is by the way running on a VM managed with Opennebula and
stored on the ceph cluster - so in case it is not reachable, you can
guess why ;-)

[0] http://www.nico.schottelius.org/software/cdist/

Craig Lewis [Mon, Dec 22, 2014 at 11:44:43AM -0800]:
 I get the impression that more people on the ML are using a config
 management system.  ceph-deploy questions seem to come from new users
 following the quick start guide.
 
 I know both Puppet and Chef are fairly well represented here.  I've seen a
 few posts about Salt and Ansible, but not much.  Calamari is built on top
 of Salt, so I suppose that means Salt is well represented.  I really
 haven't seen anything from the CFEngine or Bcfg2 camps.
 
 
 I'm personally using Chef with a private fork of the Ceph cookbook.  The
 Ceph cookbook doesn't use ceph-deploy, but it does use ceph-disk.  Whenever
 I have problems with the ceph-disk command, I first go look at the cookbook
 to see how it's doing things.
 
 
 
 On Sun, Dec 21, 2014 at 10:37 AM, Nico Schottelius 
 nico-ceph-us...@schottelius.org wrote:
 
  Hello list,
 
  I am a bit wondering about ceph-deploy and the development of ceph: I
  see that many people in the community are pushing towards the use of
  ceph-deploy, likely to ease use of ceph.
 
  However, I have run multiple times into issues using ceph-deploy, when
  it failed or incorrectly setup partitions or created a cluster of
  monitors that never reach a qurom.
 
  I have also recognised debugging and learning of ceph being much more
  difficult with ceph-deploy, compared to going the manual way, because as
  a user I miss a lot of information.
 
  Furthermore as the maintainer of a configuration management system [0],
  I am interested in knowing how things are working behind the scenes to
  be able to automate them.
 
  Thus I was wondering, if it is an option for the ceph community to
  focus on both (the manual  ceph-deploy) way instead of just pushing
  ceph-deploy?
 
  Cheers,
 
  Nico
 
  p.s.: Loic, just taking your mail as an example, but it is not personal
  - just want to show my point.
 
  Loic Dachary [Sun, Dec 21, 2014 at 06:08:27PM +0100]:
   [...]
  
   Is there a reason why you need to do this instead of letting ceph-disk
  prepare do it for you ?
  
   [...]
 
  --
  New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy state of documentation [was: OSD JOURNAL not associated - ceph-disk list ?]

2014-12-21 Thread Nico Schottelius
Hello list,

I am a bit wondering about ceph-deploy and the development of ceph: I
see that many people in the community are pushing towards the use of
ceph-deploy, likely to ease use of ceph.

However, I have run multiple times into issues using ceph-deploy, when
it failed or incorrectly setup partitions or created a cluster of
monitors that never reach a qurom.

I have also recognised debugging and learning of ceph being much more
difficult with ceph-deploy, compared to going the manual way, because as
a user I miss a lot of information.

Furthermore as the maintainer of a configuration management system [0],
I am interested in knowing how things are working behind the scenes to
be able to automate them.

Thus I was wondering, if it is an option for the ceph community to
focus on both (the manual  ceph-deploy) way instead of just pushing
ceph-deploy?

Cheers,

Nico

p.s.: Loic, just taking your mail as an example, but it is not personal
- just want to show my point.

Loic Dachary [Sun, Dec 21, 2014 at 06:08:27PM +0100]:
 [...]
 
 Is there a reason why you need to do this instead of letting ceph-disk 
 prepare do it for you ?
 
 [...]

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hanging VMs with Qemu + RBD

2014-12-19 Thread Nico Schottelius
Hello,

another issue we have experienced with qemu VMs 
(qemu 2.0.0) with ceph-0.80 on Ubuntu 14.04 
managed by opennebula 4.10.1:

The VMs are completly frozen when rebalancing takes place,
they do not even respond to ping anymore.

Looking at the qemu processes they are in state Sl.

Is this a known problem / have others seen this behaviour?

I have not yet tuned any backfilling parameters and it is a
cluster of 3 hosts with one host having 6 osds and two 1 one (so 8 osds
in total).

Our qemu runs with these rbd related options:

qemu-system-x86_64 ... -drive 

file=rbd:one/one-38:id=libvirt:key=...:auth_supported=cephx\;none:mon_host=kaffee.private.ungleich.ch\;wein.private.ungleich.ch\;tee.private.ungleich.ch,if=none,id=drive-ide0-0-0,format=raw,cache=none

Cheers,

Nico

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com