Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Adrian Saul

We see the same messages and are similarly on a 4.4 KRBD version that is 
affected by this.

I have seen no impact from it so far that I know about


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jason Dillaman
> Sent: Thursday, 5 October 2017 5:45 AM
> To: Gregory Farnum 
> Cc: ceph-users ; Josy
> 
> Subject: Re: [ceph-users] bad crc/signature errors
>
> Perhaps this is related to a known issue on some 4.4 and later kernels [1]
> where the stable write flag was not preserved by the kernel?
>
> [1] http://tracker.ceph.com/issues/19275
>
> On Wed, Oct 4, 2017 at 2:36 PM, Gregory Farnum 
> wrote:
> > That message indicates that the checksums of messages between your
> > kernel client and OSD are incorrect. It could be actual physical
> > transmission errors, but if you don't see other issues then this isn't
> > fatal; they can recover from it.
> >
> > On Wed, Oct 4, 2017 at 8:52 AM Josy 
> wrote:
> >>
> >> Hi,
> >>
> >> We have setup a cluster with 8 OSD servers (31 disks)
> >>
> >> Ceph health is Ok.
> >> --
> >> [root@las1-1-44 ~]# ceph -s
> >>cluster:
> >>  id: de296604-d85c-46ab-a3af-add3367f0e6d
> >>  health: HEALTH_OK
> >>
> >>services:
> >>  mon: 3 daemons, quorum
> >> ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
> >>  mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
> >>  osd: 31 osds: 31 up, 31 in
> >>
> >>data:
> >>  pools:   4 pools, 510 pgs
> >>  objects: 459k objects, 1800 GB
> >>  usage:   5288 GB used, 24461 GB / 29749 GB avail
> >>  pgs: 510 active+clean
> >> 
> >>
> >> We created a pool and mounted it as RBD in one of the client server.
> >> While adding data to it, we see this below error :
> >>
> >> 
> >> [939656.039750] libceph: osd20 10.255.0.9:6808 bad crc/signature
> >> [939656.041079] libceph: osd16 10.255.0.8:6816 bad crc/signature
> >> [939735.627456] libceph: osd11 10.255.0.7:6800 bad crc/signature
> >> [939735.628293] libceph: osd30 10.255.0.11:6804 bad crc/signature
> >>
> >> =
> >>
> >> Can anyone explain what is this and if I can fix it ?
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Joao Eduardo Luis

On 10/04/2017 09:19 PM, Gregory Farnum wrote:
Oh, hmm, you're right. I see synchronization starts but it seems to 
progress very slowly, and it certainly doesn't complete in that 2.5 
minute logging window. I don't see any clear reason why it's so slow; it 
might be more clear if you could provide logs of the other logs at the 
same time (especially since you now say they are getting stuck in the 
electing state during that period). Perhaps Kefu or Joao will have some 
clearer idea what the problem is.

-Greg


I haven't gone through logs yet (maybe Friday, it's late today and it's 
a holiday tomorrow), but not so long ago I seem to recall someone having 
a similar issue with the monitors that was solely related to a switch's 
MTU being too small.


Maybe that could be the case? If not, I'll take a look at the logs as 
soon as possible.


  -Joao



On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius 
mailto:nico.schottel...@ungleich.ch>> wrote:



Some more detail:

when restarting the monitor on server1, it stays in synchronizing state
forever.

However the other two monitors change into electing state.

I have double checked that there are not (host) firewalls active and
that the times are within 1 second different of the hosts (they all have
ntpd running).

We are running everything on IPv6, but this should not be a problem,
should it?

Best,

Nico


Nico Schottelius mailto:nico.schottel...@ungleich.ch>> writes:

 > Hello Gregory,
 >
 > the logfile I produced has already debug mon = 20 set:
 >
 > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
 > debug mon = 20
 >
 > It is clear that server1 is out of quorum, however how do we make it
 > being part of the quorum again?
 >
 > I expected that the quorum finding process is triggered automatically
 > after restarting the monitor, or is that incorrect?
 >
 > Best,
 >
 > Nico
 >
 >
 > Gregory Farnum mailto:gfar...@redhat.com>>
writes:
 >
 >> You'll need to change the config so that it's running "debug mon
= 20" for
 >> the log to be very useful here. It does say that it's dropping
client
 >> connections because it's been out of quorum for too long, which
is the
 >> correct behavior in general. I'd imagine that you've got clients
trying to
 >> connect to the new monitor instead of the ones already in the
quorum and
 >> not passing around correctly; this is all configurable.
 >>
 >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
 >> nico.schottel...@ungleich.ch
> wrote:
 >>
 >>>
 >>> Good morning,
 >>>
 >>> we have recently upgraded our kraken cluster to luminous and
since then
 >>> noticed an odd behaviour: we cannot add a monitor anymore.
 >>>
 >>> As soon as we start a new monitor (server2), ceph -s and ceph
-w start to
 >>> hang.
 >>>
 >>> The situation became worse, since one of our staff stopped an
existing
 >>> monitor (server1), as restarting that monitor results in the same
 >>> situation, ceph -s hangs until we stop the monitor again.
 >>>
 >>> We kept the monitor running for some minutes, but the situation
never
 >>> cleares up.
 >>>
 >>> The network does not have any firewall in between the nodes and
there
 >>> are no host firewalls.
 >>>
 >>> I have attached the output of the monitor on server1, running in
 >>> foreground using
 >>>
 >>> root@server1:~# ceph-mon -i server1 --pid-file
 >>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf
--cluster ceph
 >>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
 >>>
 >>> Does anyone see any obvious problem in the attached log?
 >>>
 >>> Any input or hint would be appreciated!
 >>>
 >>> Best,
 >>>
 >>> Nico
 >>>
 >>>
 >>>
 >>> --
 >>> Modern, affordable, Swiss Virtual Machines. Visit
www.datacenterlight.ch 
 >>> ___
 >>> ceph-users mailing list
 >>> ceph-users@lists.ceph.com 
 >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 >>>


--
Modern, affordable, Swiss Virtual Machines. Visit
www.datacenterlight.ch 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic timeline

2017-10-04 Thread Sage Weil
On Wed, 4 Oct 2017, Sage Weil wrote:
> Hi everyone,
> 
> After further discussion we are targetting 9 months for Mimic 13.2.0:
> 
>  - Mar 16, 2018 feature freeze
>  - May 1, 2018 release
> 
> Upgrades for Mimic will be from Luminous only (we've already made that a 
> required stop), but we plan to allow Luminous -> Nautilus too (and Mimic 
> -> O).

Also, each release will be stable, with bug fixes for 2 cycles (18 
months).  Here are the proposed changes to the release cycle docs:

https://github.com/ceph/ceph/pull/18117/files

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mimic timeline

2017-10-04 Thread Sage Weil
Hi everyone,

After further discussion we are targetting 9 months for Mimic 13.2.0:

 - Mar 16, 2018 feature freeze
 - May 1, 2018 release

Upgrades for Mimic will be from Luminous only (we've already made that a 
required stop), but we plan to allow Luminous -> Nautilus too (and Mimic 
-> O).

Nautilus is planned for 9 months after that... Feb 1, 2019.

Thanks, everyone, for your input!  We hope this works well enough for 
everyone, and that additional predictability makes your planning easier.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Gregory Farnum
Oh, hmm, you're right. I see synchronization starts but it seems to
progress very slowly, and it certainly doesn't complete in that 2.5 minute
logging window. I don't see any clear reason why it's so slow; it might be
more clear if you could provide logs of the other logs at the same time
(especially since you now say they are getting stuck in the electing state
during that period). Perhaps Kefu or Joao will have some clearer idea what
the problem is.
-Greg

On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius <
nico.schottel...@ungleich.ch> wrote:

>
> Some more detail:
>
> when restarting the monitor on server1, it stays in synchronizing state
> forever.
>
> However the other two monitors change into electing state.
>
> I have double checked that there are not (host) firewalls active and
> that the times are within 1 second different of the hosts (they all have
> ntpd running).
>
> We are running everything on IPv6, but this should not be a problem,
> should it?
>
> Best,
>
> Nico
>
>
> Nico Schottelius  writes:
>
> > Hello Gregory,
> >
> > the logfile I produced has already debug mon = 20 set:
> >
> > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
> > debug mon = 20
> >
> > It is clear that server1 is out of quorum, however how do we make it
> > being part of the quorum again?
> >
> > I expected that the quorum finding process is triggered automatically
> > after restarting the monitor, or is that incorrect?
> >
> > Best,
> >
> > Nico
> >
> >
> > Gregory Farnum  writes:
> >
> >> You'll need to change the config so that it's running "debug mon = 20"
> for
> >> the log to be very useful here. It does say that it's dropping client
> >> connections because it's been out of quorum for too long, which is the
> >> correct behavior in general. I'd imagine that you've got clients trying
> to
> >> connect to the new monitor instead of the ones already in the quorum and
> >> not passing around correctly; this is all configurable.
> >>
> >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
> >> nico.schottel...@ungleich.ch> wrote:
> >>
> >>>
> >>> Good morning,
> >>>
> >>> we have recently upgraded our kraken cluster to luminous and since then
> >>> noticed an odd behaviour: we cannot add a monitor anymore.
> >>>
> >>> As soon as we start a new monitor (server2), ceph -s and ceph -w start
> to
> >>> hang.
> >>>
> >>> The situation became worse, since one of our staff stopped an existing
> >>> monitor (server1), as restarting that monitor results in the same
> >>> situation, ceph -s hangs until we stop the monitor again.
> >>>
> >>> We kept the monitor running for some minutes, but the situation never
> >>> cleares up.
> >>>
> >>> The network does not have any firewall in between the nodes and there
> >>> are no host firewalls.
> >>>
> >>> I have attached the output of the monitor on server1, running in
> >>> foreground using
> >>>
> >>> root@server1:~# ceph-mon -i server1 --pid-file
> >>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph
> >>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
> >>>
> >>> Does anyone see any obvious problem in the attached log?
> >>>
> >>> Any input or hint would be appreciated!
> >>>
> >>> Best,
> >>>
> >>> Nico
> >>>
> >>>
> >>>
> >>> --
> >>> Modern, affordable, Swiss Virtual Machines. Visit
> www.datacenterlight.ch
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
>
>
> --
> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius

Some more detail:

when restarting the monitor on server1, it stays in synchronizing state
forever.

However the other two monitors change into electing state.

I have double checked that there are not (host) firewalls active and
that the times are within 1 second different of the hosts (they all have
ntpd running).

We are running everything on IPv6, but this should not be a problem,
should it?

Best,

Nico


Nico Schottelius  writes:

> Hello Gregory,
>
> the logfile I produced has already debug mon = 20 set:
>
> [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
> debug mon = 20
>
> It is clear that server1 is out of quorum, however how do we make it
> being part of the quorum again?
>
> I expected that the quorum finding process is triggered automatically
> after restarting the monitor, or is that incorrect?
>
> Best,
>
> Nico
>
>
> Gregory Farnum  writes:
>
>> You'll need to change the config so that it's running "debug mon = 20" for
>> the log to be very useful here. It does say that it's dropping client
>> connections because it's been out of quorum for too long, which is the
>> correct behavior in general. I'd imagine that you've got clients trying to
>> connect to the new monitor instead of the ones already in the quorum and
>> not passing around correctly; this is all configurable.
>>
>> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
>> nico.schottel...@ungleich.ch> wrote:
>>
>>>
>>> Good morning,
>>>
>>> we have recently upgraded our kraken cluster to luminous and since then
>>> noticed an odd behaviour: we cannot add a monitor anymore.
>>>
>>> As soon as we start a new monitor (server2), ceph -s and ceph -w start to
>>> hang.
>>>
>>> The situation became worse, since one of our staff stopped an existing
>>> monitor (server1), as restarting that monitor results in the same
>>> situation, ceph -s hangs until we stop the monitor again.
>>>
>>> We kept the monitor running for some minutes, but the situation never
>>> cleares up.
>>>
>>> The network does not have any firewall in between the nodes and there
>>> are no host firewalls.
>>>
>>> I have attached the output of the monitor on server1, running in
>>> foreground using
>>>
>>> root@server1:~# ceph-mon -i server1 --pid-file
>>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph
>>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
>>>
>>> Does anyone see any obvious problem in the attached log?
>>>
>>> Any input or hint would be appreciated!
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>>
>>> --
>>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph multi active mds and failover with ceph version 12.2.1

2017-10-04 Thread Pavan, Krish
I am evaluating multi active mds ( 2 active and 1 standby)  and seemed the 
failover take longer time.

  1.   Is there anyway reduce failover time and hang the mount point ( 
ceph-fuse)?.
  2.  Is mds_standby_replay config valid options to enable warmup metadata 
cache  with multimds?

Regards
Krish

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius

Hello Gregory,

the logfile I produced has already debug mon = 20 set:

[21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf
debug mon = 20

It is clear that server1 is out of quorum, however how do we make it
being part of the quorum again?

I expected that the quorum finding process is triggered automatically
after restarting the monitor, or is that incorrect?

Best,

Nico


Gregory Farnum  writes:

> You'll need to change the config so that it's running "debug mon = 20" for
> the log to be very useful here. It does say that it's dropping client
> connections because it's been out of quorum for too long, which is the
> correct behavior in general. I'd imagine that you've got clients trying to
> connect to the new monitor instead of the ones already in the quorum and
> not passing around correctly; this is all configurable.
>
> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Good morning,
>>
>> we have recently upgraded our kraken cluster to luminous and since then
>> noticed an odd behaviour: we cannot add a monitor anymore.
>>
>> As soon as we start a new monitor (server2), ceph -s and ceph -w start to
>> hang.
>>
>> The situation became worse, since one of our staff stopped an existing
>> monitor (server1), as restarting that monitor results in the same
>> situation, ceph -s hangs until we stop the monitor again.
>>
>> We kept the monitor running for some minutes, but the situation never
>> cleares up.
>>
>> The network does not have any firewall in between the nodes and there
>> are no host firewalls.
>>
>> I have attached the output of the monitor on server1, running in
>> foreground using
>>
>> root@server1:~# ceph-mon -i server1 --pid-file
>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph
>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
>>
>> Does anyone see any obvious problem in the attached log?
>>
>> Any input or hint would be appreciated!
>>
>> Best,
>>
>> Nico
>>
>>
>>
>> --
>> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Jason Dillaman
Perhaps this is related to a known issue on some 4.4 and later kernels
[1] where the stable write flag was not preserved by the kernel?

[1] http://tracker.ceph.com/issues/19275

On Wed, Oct 4, 2017 at 2:36 PM, Gregory Farnum  wrote:
> That message indicates that the checksums of messages between your kernel
> client and OSD are incorrect. It could be actual physical transmission
> errors, but if you don't see other issues then this isn't fatal; they can
> recover from it.
>
> On Wed, Oct 4, 2017 at 8:52 AM Josy  wrote:
>>
>> Hi,
>>
>> We have setup a cluster with 8 OSD servers (31 disks)
>>
>> Ceph health is Ok.
>> --
>> [root@las1-1-44 ~]# ceph -s
>>cluster:
>>  id: de296604-d85c-46ab-a3af-add3367f0e6d
>>  health: HEALTH_OK
>>
>>services:
>>  mon: 3 daemons, quorum
>> ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
>>  mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
>>  osd: 31 osds: 31 up, 31 in
>>
>>data:
>>  pools:   4 pools, 510 pgs
>>  objects: 459k objects, 1800 GB
>>  usage:   5288 GB used, 24461 GB / 29749 GB avail
>>  pgs: 510 active+clean
>> 
>>
>> We created a pool and mounted it as RBD in one of the client server.
>> While adding data to it, we see this below error :
>>
>> 
>> [939656.039750] libceph: osd20 10.255.0.9:6808 bad crc/signature
>> [939656.041079] libceph: osd16 10.255.0.8:6816 bad crc/signature
>> [939735.627456] libceph: osd11 10.255.0.7:6800 bad crc/signature
>> [939735.628293] libceph: osd30 10.255.0.11:6804 bad crc/signature
>>
>> =
>>
>> Can anyone explain what is this and if I can fix it ?
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bad crc/signature errors

2017-10-04 Thread Gregory Farnum
That message indicates that the checksums of messages between your kernel
client and OSD are incorrect. It could be actual physical transmission
errors, but if you don't see other issues then this isn't fatal; they can
recover from it.

On Wed, Oct 4, 2017 at 8:52 AM Josy  wrote:

> Hi,
>
> We have setup a cluster with 8 OSD servers (31 disks)
>
> Ceph health is Ok.
> --
> [root@las1-1-44 ~]# ceph -s
>cluster:
>  id: de296604-d85c-46ab-a3af-add3367f0e6d
>  health: HEALTH_OK
>
>services:
>  mon: 3 daemons, quorum ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
>  mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
>  osd: 31 osds: 31 up, 31 in
>
>data:
>  pools:   4 pools, 510 pgs
>  objects: 459k objects, 1800 GB
>  usage:   5288 GB used, 24461 GB / 29749 GB avail
>  pgs: 510 active+clean
> 
>
> We created a pool and mounted it as RBD in one of the client server.
> While adding data to it, we see this below error :
>
> 
> [939656.039750] libceph: osd20 10.255.0.9:6808 bad crc/signature
> [939656.041079] libceph: osd16 10.255.0.8:6816 bad crc/signature
> [939735.627456] libceph: osd11 10.255.0.7:6800 bad crc/signature
> [939735.628293] libceph: osd30 10.255.0.11:6804 bad crc/signature
>
> =
>
> Can anyone explain what is this and if I can fix it ?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pg on erasure coded pool

2017-10-04 Thread Gregory Farnum
This says it's actually missing one object, and a repair won't fix that (if
it could, the object wouldn't be missing!). There should be more details
somewhere in the logs about which object.

On Wed, Oct 4, 2017 at 5:03 AM Kenneth Waegeman 
wrote:

> Hi,
>
> We have some inconsistency / scrub error on a Erasure coded pool, that I
> can't seem to solve.
>
> [root@osd008 ~]# ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 5.144 is active+clean+inconsistent, acting
> [81,119,148,115,142,100,25,63,48,11,43]
> 1 scrub errors
>
> In the log files, it seems there is 1 missing shard:
>
> /var/log/ceph/ceph-osd.81.log.2.gz:2017-10-02 23:49:11.940624
> 7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144s0 shard 63(7)
> missing 5:2297a2e1:::10014e2d8d5.:head
> /var/log/ceph/ceph-osd.81.log.2.gz:2017-10-03 00:48:06.681941
> 7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144s0 deep-scrub 1
> missing, 0 inconsistent objects
> /var/log/ceph/ceph-osd.81.log.2.gz:2017-10-03 00:48:06.681947
> 7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144 deep-scrub 1 errors
>
> I tried running ceph pg repair on the pg, but nothing changed. I also
> tried starting a new deep-scrub on the  osd 81 (ceph osd deep-scrub 81)
> but I don't see any deep-scrub starting at the osd.
>
> How can we solve this ?
>
> Thank you!
>
>
> Kenneth
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Gregory Farnum
You'll need to change the config so that it's running "debug mon = 20" for
the log to be very useful here. It does say that it's dropping client
connections because it's been out of quorum for too long, which is the
correct behavior in general. I'd imagine that you've got clients trying to
connect to the new monitor instead of the ones already in the quorum and
not passing around correctly; this is all configurable.

On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius <
nico.schottel...@ungleich.ch> wrote:

>
> Good morning,
>
> we have recently upgraded our kraken cluster to luminous and since then
> noticed an odd behaviour: we cannot add a monitor anymore.
>
> As soon as we start a new monitor (server2), ceph -s and ceph -w start to
> hang.
>
> The situation became worse, since one of our staff stopped an existing
> monitor (server1), as restarting that monitor results in the same
> situation, ceph -s hangs until we stop the monitor again.
>
> We kept the monitor running for some minutes, but the situation never
> cleares up.
>
> The network does not have any firewall in between the nodes and there
> are no host firewalls.
>
> I have attached the output of the monitor on server1, running in
> foreground using
>
> root@server1:~# ceph-mon -i server1 --pid-file
> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph
> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog
>
> Does anyone see any obvious problem in the attached log?
>
> Any input or hint would be appreciated!
>
> Best,
>
> Nico
>
>
>
> --
> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr summarize recovery counters

2017-10-04 Thread Gregory Farnum
On Wed, Oct 4, 2017 at 9:14 AM, Benjeman Meekhof  wrote:
> Wondering if anyone can tell me how to summarize recovery
> bytes/ops/objects from counters available in the ceph-mgr python
> interface?  To put another way, how does the ceph -s command put
> together that infomation and can I access that information from a
> counter queryable by the ceph-mgr python module api?
>
> I want info like the 'recovery' part of the status output.  I have a
> ceph-mgr module that feeds influxdb but I'm not sure what counters
> from ceph-mgr to summarize to create this information.  OSD have
> available a recovery_ops counter which is not quite the same.  Maybe
> the various 'subop_..' counters encompass recovery ops?  It's not
> clear to me but I'm hoping it is obvious to someone more familiar with
> the internals.
>
> io:
> client:   2034 B/s wr, 0 op/s rd, 0 op/s wr
> recovery: 1173 MB/s, 8 keys/s, 682 objects/s


You'll need to run queries against the PGMap. I'm not sure how that
works in the python interfaces but I'm led to believe it's possible.
Documentation is probably all in the PGMap.h header; you can look at
functions like the "recovery_rate_summary" to see what they're doing.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Developers Monthly - October

2017-10-04 Thread Leonardo Vaz
On Wed, Oct 04, 2017 at 03:02:09AM -0300, Leonardo Vaz wrote:
> On Thu, Sep 28, 2017 at 12:08:00AM -0300, Leonardo Vaz wrote:
> > Hey Cephers,
> > 
> > This is just a friendly reminder that the next Ceph Developer Montly
> > meeting is coming up:
> > 
> >  http://wiki.ceph.com/Planning
> > 
> > If you have work that you're doing that it a feature work, significant
> > backports, or anything you would like to discuss with the core team,
> > please add it to the following page:
> > 
> >  http://wiki.ceph.com/CDM_04-OCT-2017
> > 
> > If you have questions or comments, please let us know.
> 
> Hey cephers,
> 
> This is just a friendly reminder the Ceph Developer Montly is confirmed
> for today, October 4 at 12:30pm Eastern Time (EDT), on US/EMEA friendly
> hours.
> 
> If you have any topic to discuss on this meeting, please add it to the
> following pad:
> 
>   http://tracker.ceph.com/projects/ceph/wiki/CDM_04-OCT-2017
> 
> We will use the following Bluejeans URL for the video conference:
> 
>   https://bluejeans.com/707503600

Sorry for the short notice but the meeting URL has been changed:

  https://bluejeans.com/9290089010 

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-mgr summarize recovery counters

2017-10-04 Thread Benjeman Meekhof
Wondering if anyone can tell me how to summarize recovery
bytes/ops/objects from counters available in the ceph-mgr python
interface?  To put another way, how does the ceph -s command put
together that infomation and can I access that information from a
counter queryable by the ceph-mgr python module api?

I want info like the 'recovery' part of the status output.  I have a
ceph-mgr module that feeds influxdb but I'm not sure what counters
from ceph-mgr to summarize to create this information.  OSD have
available a recovery_ops counter which is not quite the same.  Maybe
the various 'subop_..' counters encompass recovery ops?  It's not
clear to me but I'm hoping it is obvious to someone more familiar with
the internals.

io:
client:   2034 B/s wr, 0 op/s rd, 0 op/s wr
recovery: 1173 MB/s, 8 keys/s, 682 objects/s

thanks,
Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] bad crc/signature errors

2017-10-04 Thread Josy

Hi,

We have setup a cluster with 8 OSD servers (31 disks)

Ceph health is Ok.
--
[root@las1-1-44 ~]# ceph -s
  cluster:
    id: de296604-d85c-46ab-a3af-add3367f0e6d
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-las-mon-a1,ceph-las-mon-a2,ceph-las-mon-a3
    mgr: ceph-las-mon-a1(active), standbys: ceph-las-mon-a2
    osd: 31 osds: 31 up, 31 in

  data:
    pools:   4 pools, 510 pgs
    objects: 459k objects, 1800 GB
    usage:   5288 GB used, 24461 GB / 29749 GB avail
    pgs: 510 active+clean


We created a pool and mounted it as RBD in one of the client server. 
While adding data to it, we see this below error :



[939656.039750] libceph: osd20 10.255.0.9:6808 bad crc/signature
[939656.041079] libceph: osd16 10.255.0.8:6816 bad crc/signature
[939735.627456] libceph: osd11 10.255.0.7:6800 bad crc/signature
[939735.628293] libceph: osd30 10.255.0.11:6804 bad crc/signature

=

Can anyone explain what is this and if I can fix it ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inconsistent pg on erasure coded pool

2017-10-04 Thread Kenneth Waegeman

Hi,

We have some inconsistency / scrub error on a Erasure coded pool, that I 
can't seem to solve.


[root@osd008 ~]# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 5.144 is active+clean+inconsistent, acting 
[81,119,148,115,142,100,25,63,48,11,43]

1 scrub errors

In the log files, it seems there is 1 missing shard:

/var/log/ceph/ceph-osd.81.log.2.gz:2017-10-02 23:49:11.940624 
7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144s0 shard 63(7) 
missing 5:2297a2e1:::10014e2d8d5.:head
/var/log/ceph/ceph-osd.81.log.2.gz:2017-10-03 00:48:06.681941 
7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144s0 deep-scrub 1 
missing, 0 inconsistent objects
/var/log/ceph/ceph-osd.81.log.2.gz:2017-10-03 00:48:06.681947 
7f0a9d7e2700 -1 log_channel(cluster) log [ERR] : 5.144 deep-scrub 1 errors


I tried running ceph pg repair on the pg, but nothing changed. I also 
tried starting a new deep-scrub on the  osd 81 (ceph osd deep-scrub 81) 
but I don't see any deep-scrub starting at the osd.


How can we solve this ?

Thank you!


Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous cluster stuck when adding monitor

2017-10-04 Thread Nico Schottelius

Good morning,

we have recently upgraded our kraken cluster to luminous and since then
noticed an odd behaviour: we cannot add a monitor anymore.

As soon as we start a new monitor (server2), ceph -s and ceph -w start to hang.

The situation became worse, since one of our staff stopped an existing
monitor (server1), as restarting that monitor results in the same
situation, ceph -s hangs until we stop the monitor again.

We kept the monitor running for some minutes, but the situation never
cleares up.

The network does not have any firewall in between the nodes and there
are no host firewalls.

I have attached the output of the monitor on server1, running in
foreground using

root@server1:~# ceph-mon -i server1 --pid-file  
/var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf --cluster ceph 
--setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog

Does anyone see any obvious problem in the attached log?

Any input or hint would be appreciated!

Best,

Nico



cephmonlog.bz2
Description: BZip2 compressed data


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-10-04 Thread Micha Krause

Hi,


Did you edit the code before trying Luminous?


Yes, I'm still on jewel.



I also noticed from your  > original mail that it appears you're using multiple 
active metadata> servers? If so, that's not stable in Jewel. You may have tripped 
on> one of many bugs fixed in Luminous for that configuration.

No, Im using active/backup configuration.


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread lists

ok, thanks for the feedback Piotr and Dan!

MJ

On 4-10-2017 9:38, Dan van der Ster wrote:

Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never
contacted", resulting in "pgs are stuck inactive for more than 300 seconds"
being reported until osds regain connections between themselves.



Also, the last_active state isn't updated very regularly, as far as I can tell.
On our cluster I have increased this timeout

--mon_pg_stuck_threshold: 1800

(Which helps suppress these bogus HEALTH_ERR's)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread Dan van der Ster
On Wed, Oct 4, 2017 at 9:08 AM, Piotr Dałek  wrote:
> On 17-10-04 08:51 AM, lists wrote:
>>
>> Hi,
>>
>> Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our
>> jewel migration, and noticed something interesting.
>>
>> After I brought back up the OSDs I just chowned, the system had some
>> recovery to do. During that recovery, the system went to HEALTH_ERR for a
>> short moment:
>>
>> See below, for consecutive ceph -s outputs:
>>
>>> [..]
>>> root@pm2:~# ceph -s
>>> cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
>>>  health HEALTH_ERR
>>> 2 pgs are stuck inactive for more than 300 seconds
>
>
> ^^ that.
>
>>> 761 pgs degraded
>>> 2 pgs recovering
>>> 181 pgs recovery_wait
>>> 2 pgs stuck inactive
>>> 273 pgs stuck unclean
>>> 543 pgs undersized
>>> recovery 1394085/8384166 objects degraded (16.628%)
>>> 4/24 in osds are down
>>> noout flag(s) set
>>>  monmap e3: 3 mons at
>>> {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}
>>> election epoch 256, quorum 0,1,2 0,1,2
>>>  osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
>>> flags noout,sortbitwise,require_jewel_osds
>>>   pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
>>> 32724 GB used, 56656 GB / 89380 GB avail
>>> 1394085/8384166 objects degraded (16.628%)
>>>  543 active+undersized+degraded
>>>  310 active+clean
>>>  181 active+recovery_wait+degraded
>>>   26 active+degraded
>>>   13 active
>>>9 activating+degraded
>>>4 activating
>>>2 active+recovering+degraded
>>> recovery io 133 MB/s, 37 objects/s
>>>   client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
>>> [..]
>>
>> It was only very briefly, but it did worry me a bit, fortunately, we went
>> back to the expected HEALTH_WARN very quickly, and everything finished fine,
>> so I guess nothing to worry.
>>
>> But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?
>>
>> No smart errors, apply and commit latency are all within the expected
>> ranges, the systems basically is healthy.
>>
>> Curious :-)
>
>
> Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never
> contacted", resulting in "pgs are stuck inactive for more than 300 seconds"
> being reported until osds regain connections between themselves.
>

Also, the last_active state isn't updated very regularly, as far as I can tell.
On our cluster I have increased this timeout

--mon_pg_stuck_threshold: 1800

(Which helps suppress these bogus HEALTH_ERR's)

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why sudden (and brief) HEALTH_ERR

2017-10-04 Thread Piotr Dałek

On 17-10-04 08:51 AM, lists wrote:

Hi,

Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel 
migration, and noticed something interesting.


After I brought back up the OSDs I just chowned, the system had some 
recovery to do. During that recovery, the system went to HEALTH_ERR for a 
short moment:


See below, for consecutive ceph -s outputs:


[..]
root@pm2:~# ceph -s
    cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1
 health HEALTH_ERR
    2 pgs are stuck inactive for more than 300 seconds


^^ that.


    761 pgs degraded
    2 pgs recovering
    181 pgs recovery_wait
    2 pgs stuck inactive
    273 pgs stuck unclean
    543 pgs undersized
    recovery 1394085/8384166 objects degraded (16.628%)
    4/24 in osds are down
    noout flag(s) set
 monmap e3: 3 mons at 
{0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0}

    election epoch 256, quorum 0,1,2 0,1,2
 osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs
    flags noout,sortbitwise,require_jewel_osds
  pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects
    32724 GB used, 56656 GB / 89380 GB avail
    1394085/8384166 objects degraded (16.628%)
 543 active+undersized+degraded
 310 active+clean
 181 active+recovery_wait+degraded
  26 active+degraded
  13 active
   9 activating+degraded
   4 activating
   2 active+recovering+degraded
recovery io 133 MB/s, 37 objects/s
  client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr
[..]
It was only very briefly, but it did worry me a bit, fortunately, we went 
back to the expected HEALTH_WARN very quickly, and everything finished fine, 
so I guess nothing to worry.


But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR?

No smart errors, apply and commit latency are all within the expected 
ranges, the systems basically is healthy.


Curious :-)


Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never 
contacted", resulting in "pgs are stuck inactive for more than 300 seconds" 
being reported until osds regain connections between themselves.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com