Re: [ceph-users] Need help for PG problem

2016-03-22 Thread Zhang Qiang
Hi Reddy,
It's over a thousand lines, I pasted it on gist:
https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4

On Tue, 22 Mar 2016 at 18:15 M Ranga Swami Reddy <swamire...@gmail.com>
wrote:

> Hi,
> Can you please share the "ceph health detail" output?
>
> Thanks
> Swami
>
> On Tue, Mar 22, 2016 at 3:32 PM, Zhang Qiang <dotslash...@gmail.com>
> wrote:
> > Hi all,
> >
> > I have 20 OSDs and 1 pool, and, as recommended by the
> > doc(http://docs.ceph.com/docs/master/rados/operations/placement-groups/),
> I
> > configured pg_num and pgp_num to 4096, size 2, min size 1.
> >
> > But ceph -s shows:
> >
> > HEALTH_WARN
> > 534 pgs degraded
> > 551 pgs stuck unclean
> > 534 pgs undersized
> > too many PGs per OSD (382 > max 300)
> >
> > Why the recommended value, 4096, for 10 ~ 50 OSDs doesn't work?  And what
> > does it mean by "too many PGs per OSD (382 > max 300)"? If per OSD has
> 382
> > PGs I would have had 7640 PGs.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need help for PG problem

2016-03-22 Thread Zhang Qiang
Hi all,

I have 20 OSDs and 1 pool, and, as recommended by the doc(
http://docs.ceph.com/docs/master/rados/operations/placement-groups/), I
configured pg_num and pgp_num to 4096, size 2, min size 1.

But ceph -s shows:

HEALTH_WARN
534 pgs degraded
551 pgs stuck unclean
534 pgs undersized
too many PGs per OSD (382 > max 300)

Why the recommended value, 4096, for 10 ~ 50 OSDs doesn't work?  And what
does it mean by "too many PGs per OSD (382 > max 300)"? If per OSD has 382
PGs I would have had 7640 PGs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-22 Thread Zhang Qiang
I got it, the pg_num suggested is the total, I need to divide it by the
number of replications.
Thanks Oliver, your answer is very thorough and helpful!


On 23 March 2016 at 02:19, Oliver Dzombic <i...@ip-interactive.de> wrote:

> Hi Zhang,
>
> yeah i saw your answer already.
>
> At very first, you should make sure that there is no clock skew.
> This can cause some sideeffects.
>
> 
>
> According to
>
> http://docs.ceph.com/docs/master/rados/operations/placement-groups/
>
> you have to:
>
> (OSDs * 100)
> Total PGs =  
>   pool size
>
>
> Means:
>
> 20 OSD's of you * 100 = 2000
>
> Poolsize is:
>
> Where pool size is either the number of replicas for replicated pools or
> the K+M sum for erasure coded pools (as returned by ceph osd
> erasure-code-profile get).
>
> --
>
> So lets say, you have 2 replications, you should have 1000 PG's.
>
> If you have 3 replications, you should have 2000 / 3 = 666 PG's.
>
> But you configured 4096 PGs. Thats simply far too much.
>
> Reduce it. Or, if you can not, get more OSD's into this.
>
> I dont know any other way.
>
> Good luck !
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 22.03.2016 um 19:02 schrieb Zhang Qiang:
> > Hi Oliver,
> >
> > Thanks for your reply to my question on Ceph mailing list. I somehow
> > wasn't able to receive your reply in my mailbox, but I saw your reply in
> > the archive, so I have to mail you personally.
> >
> > I have pasted the whole ceph health output on gist:
> > https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4
> >
> > Hope this will help. Thank you!
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-23 Thread Zhang Qiang
And here's the osd tree if it matters.

ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 22.39984 root default
-2 21.39984 host 10
 0  1.06999 osd.0up  1.0  1.0
 1  1.06999 osd.1up  1.0  1.0
 2  1.06999 osd.2up  1.0  1.0
 3  1.06999 osd.3up  1.0  1.0
 4  1.06999 osd.4up  1.0  1.0
 5  1.06999 osd.5up  1.0  1.0
 6  1.06999 osd.6up  1.0  1.0
 7  1.06999 osd.7up  1.0  1.0
 8  1.06999 osd.8up  1.0  1.0
 9  1.06999 osd.9up  1.0  1.0
10  1.06999 osd.10   up  1.0  1.0
11  1.06999 osd.11   up  1.0  1.0
12  1.06999 osd.12   up  1.0  1.0
13  1.06999 osd.13   up  1.0  1.0
14  1.06999 osd.14   up  1.0  1.0
15  1.06999 osd.15   up  1.0  1.0
16  1.06999 osd.16   up  1.0  1.0
17  1.06999 osd.17   up  1.0  1.0
18  1.06999 osd.18   up  1.0  1.0
19  1.06999 osd.19   up  1.0  1.0
-3  1.0 host 148_96
 0  1.0 osd.0up  1.0  1.0

On Wed, 23 Mar 2016 at 19:10 Zhang Qiang <dotslash...@gmail.com> wrote:

> Oliver, Goncalo,
>
> Sorry to disturb again, but recreating the pool with a smaller pg_num
> didn't seem to work, now all 666 pgs are degraded + undersized.
>
> New status:
> cluster d2a69513-ad8e-4b25-8f10-69c4041d624d
>  health HEALTH_WARN
> 666 pgs degraded
> 82 pgs stuck unclean
> 666 pgs undersized
>  monmap e5: 5 mons at {1=
> 10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
> }
> election epoch 28, quorum 0,1,2,3,4
> GGZ-YG-S0311-PLATFORM-138,1,2,3,4
>  osdmap e705: 20 osds: 20 up, 20 in
>   pgmap v1961: 666 pgs, 1 pools, 0 bytes data, 0 objects
> 13223 MB used, 20861 GB / 21991 GB avail
>  666 active+undersized+degraded
>
> Only one pool and its size is 3. So I think according to the algorithm,
> (20 * 100) / 3 = 666 pgs is reasonable.
>
> I updated health detail and also attached a pg query result on gist(
> https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4).
>
> On Wed, 23 Mar 2016 at 09:01 Dotslash Lu <dotslash...@gmail.com> wrote:
>
>> Hello Gonçalo,
>>
>> Thanks for your reminding. I was just setting up the cluster for test, so
>> don't worry, I can just remove the pool. And I learnt that since the
>> replication number and pool number are related to pg_num, I'll consider
>> them carefully before deploying any data.
>>
>> On Mar 23, 2016, at 6:58 AM, Goncalo Borges <goncalo.bor...@sydney.edu.au>
>> wrote:
>>
>> Hi Zhang...
>>
>> If I can add some more info, the change of PGs is a heavy operation, and
>> as far as i know, you should NEVER decrease PGs. From the notes in pgcalc (
>> http://ceph.com/pgcalc/):
>>
>> "It's also important to know that the PG count can be increased, but
>> NEVER decreased without destroying / recreating the pool. However,
>> increasing the PG Count of a pool is one of the most impactful events in a
>> Ceph Cluster, and should be avoided for production clusters if possible."
>>
>> So, in your case, I would consider in adding more OSDs.
>>
>> Cheers
>> Goncalo
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-23 Thread Zhang Qiang
Oliver, Goncalo,

Sorry to disturb again, but recreating the pool with a smaller pg_num
didn't seem to work, now all 666 pgs are degraded + undersized.

New status:
cluster d2a69513-ad8e-4b25-8f10-69c4041d624d
 health HEALTH_WARN
666 pgs degraded
82 pgs stuck unclean
666 pgs undersized
 monmap e5: 5 mons at {1=
10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
}
election epoch 28, quorum 0,1,2,3,4
GGZ-YG-S0311-PLATFORM-138,1,2,3,4
 osdmap e705: 20 osds: 20 up, 20 in
  pgmap v1961: 666 pgs, 1 pools, 0 bytes data, 0 objects
13223 MB used, 20861 GB / 21991 GB avail
 666 active+undersized+degraded

Only one pool and its size is 3. So I think according to the algorithm, (20
* 100) / 3 = 666 pgs is reasonable.

I updated health detail and also attached a pg query result on gist(
https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4).

On Wed, 23 Mar 2016 at 09:01 Dotslash Lu  wrote:

> Hello Gonçalo,
>
> Thanks for your reminding. I was just setting up the cluster for test, so
> don't worry, I can just remove the pool. And I learnt that since the
> replication number and pool number are related to pg_num, I'll consider
> them carefully before deploying any data.
>
> On Mar 23, 2016, at 6:58 AM, Goncalo Borges 
> wrote:
>
> Hi Zhang...
>
> If I can add some more info, the change of PGs is a heavy operation, and
> as far as i know, you should NEVER decrease PGs. From the notes in pgcalc (
> http://ceph.com/pgcalc/):
>
> "It's also important to know that the PG count can be increased, but NEVER
> decreased without destroying / recreating the pool. However, increasing the
> PG Count of a pool is one of the most impactful events in a Ceph Cluster,
> and should be avoided for production clusters if possible."
>
> So, in your case, I would consider in adding more OSDs.
>
> Cheers
> Goncalo
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-23 Thread Zhang Qiang
Yes it was the crush map. I updated it, distributed 20 OSDs across 2 hosts
correctly, finally all pgs are healthy.

Thanks guys, I really appreciate your help!

On Thu, 24 Mar 2016 at 07:25 Goncalo Borges <goncalo.bor...@sydney.edu.au>
wrote:

> Hi Zhang...
>
> I think you are dealing with two different problems.
>
> The first problem refers to number of PGs per OSD. That was already
> discussed, and now there is no more messages concerning it.
>
> The second problem you are experiencing seems to be that all your OSDs are
> under the same host. Besides that, osd.0 appears twice in two different
> hosts (I do not really know why is that happening). If you are using the
> default crush rules, ceph is not able to replicate objects (even with size
> 2) across two different hosts because all your OSDs are just in one host.
>
> Cheers
> Goncalo
>
> --
> *From:* Zhang Qiang [dotslash...@gmail.com]
> *Sent:* 23 March 2016 23:17
> *To:* Goncalo Borges
> *Cc:* Oliver Dzombic; ceph-users
> *Subject:* Re: [ceph-users] Need help for PG problem
>
> And here's the osd tree if it matters.
>
> ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 22.39984 root default
> -2 21.39984 host 10
>  0  1.06999 osd.0up  1.0  1.0
>  1  1.06999 osd.1up  1.0  1.0
>  2  1.06999 osd.2up  1.0  1.0
>  3  1.06999 osd.3up  1.0  1.0
>  4  1.06999 osd.4up  1.0  1.0
>  5  1.06999 osd.5up  1.0  1.0
>  6  1.06999 osd.6up  1.0  1.0
>  7  1.06999 osd.7up  1.0  1.0
>  8  1.06999 osd.8up  1.0  1.0
>  9  1.06999 osd.9up  1.0  1.0
> 10  1.06999 osd.10   up  1.0  1.0
> 11  1.06999 osd.11   up  1.0  1.0
> 12  1.06999 osd.12   up  1.0  1.0
> 13  1.06999 osd.13   up  1.0  1.0
> 14  1.06999 osd.14   up  1.0  1.0
> 15  1.06999 osd.15   up  1.0  1.0
> 16  1.06999 osd.16   up  1.0  1.0
> 17  1.06999 osd.17   up  1.0  1.0
> 18  1.06999 osd.18   up  1.0  1.0
> 19  1.06999 osd.19   up  1.0  1.0
> -3  1.0 host 148_96
>  0  1.0 osd.0up  1.0  1.0
>
> On Wed, 23 Mar 2016 at 19:10 Zhang Qiang <dotslash...@gmail.com
> <http://redir.aspx?REF=7XhgTE6Jvg0jJH-IYNTGkgF858R1R8uarnbreTlxmaNI42sab1PTCAFtYWlsdG86ZG90c2xhc2gubHVAZ21haWwuY29t>>
> wrote:
>
>> Oliver, Goncalo,
>>
>> Sorry to disturb again, but recreating the pool with a smaller pg_num
>> didn't seem to work, now all 666 pgs are degraded + undersized.
>>
>> New status:
>> cluster d2a69513-ad8e-4b25-8f10-69c4041d624d
>>  health HEALTH_WARN
>> 666 pgs degraded
>> 82 pgs stuck unclean
>> 666 pgs undersized
>>  monmap e5: 5 mons at {1=
>> 10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
>> <http://redir.aspx?REF=eHahCJ6Vheno1kM9Y6hJVYyLtJjtbgztCcJvnwMZRopI42sab1PTCAFodHRwOi8vMTAuMy4xMzguMzc6Njc4OS8wLDI9MTAuMy4xMzguMzk6Njc4OS8wLDM9MTAuMy4xMzguNDA6Njc4OS8wLDQ9MTAuMy4xMzguNTk6Njc4OS8wLEdHWi1ZRy1TMDMxMS1QTEFURk9STS0xMzg9MTAuMy4xMzguMzY6Njc4OS8w>
>> }
>> election epoch 28, quorum 0,1,2,3,4
>> GGZ-YG-S0311-PLATFORM-138,1,2,3,4
>>  osdmap e705: 20 osds: 20 up, 20 in
>>   pgmap v1961: 666 pgs, 1 pools, 0 bytes data, 0 objects
>> 13223 MB used, 20861 GB / 21991 GB avail
>>  666 active+undersized+degraded
>>
>> Only one pool and its size is 3. So I think according to the algorithm,
>> (20 * 100) / 3 = 666 pgs is reasonable.
>>
>> I updated health detail and also attached a pg query result on gist(
>> https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4
>> <http://redir.aspx?REF=Re0O2_zDHLnX00Zf3IrX215GKBz2CkZCKo_yIyQwqm1I42sab1PTCAFodHRwczovL2dpc3QuZ2l0aHViLmNvbS9kb3RTbGFzaEx1LzIyNjIzYjRjZWZhMDZhNDZlMGQ0>
>> ).
>>
>> On Wed, 23 Mar 2016 at 09:01 Dotslash Lu <dotslash...@gmail.com
>> <http://redir.aspx?REF=7XhgTE6Jvg0jJH-IYNTGkgF858R1R8uarnbreTlxmaNI42sab1PTCAFtYWlsdG86ZG90c2xhc2gubHVAZ21haWwuY29t>>
>> wrote:
>>

Re: [ceph-users] Ceph-fuse huge performance gap between different block sizes

2016-03-25 Thread Zhang Qiang
Hi Christian, Thanks for your reply, here're the test specs:
>>>
[global]
ioengine=libaio
runtime=90
direct=1
group_reporting
iodepth=16
ramp_time=5
size=1G

[seq_w_4k_20]
bs=4k
filename=seq_w_4k_20
rw=write
numjobs=20

[seq_w_1m_20]
bs=1m
filename=seq_w_1m_20
rw=write
numjobs=20
<<<<

Test results: 4k -  aggrb=13245KB/s, 1m - aggrb=1102.6MB/s

Mount options:  ceph-fuse /ceph -m 10.3.138.36:6789

Ceph configurations:
>>>>
filestore_xattr_use_omap = true
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 128
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 512
osd pool default pgp num = 512
osd crush chooseleaf type = 1
<<<<

Other configurations are all default.

Status:
 health HEALTH_OK
 monmap e5: 5 mons at {1=
10.3.138.37:6789/0,2=10.3.138.39:6789/0,3=10.3.138.40:6789/0,4=10.3.138.59:6789/0,GGZ-YG-S0311-PLATFORM-138=10.3.138.36:6789/0
}
election epoch 28, quorum 0,1,2,3,4
GGZ-YG-S0311-PLATFORM-138,1,2,3,4
 mdsmap e55: 1/1/1 up {0=1=up:active}
 osdmap e1290: 20 osds: 20 up, 20 in
  pgmap v7180: 1000 pgs, 2 pools, 14925 MB data, 3851 objects
37827 MB used, 20837 GB / 21991 GB avail
1000 active+clean

On Fri, 25 Mar 2016 at 16:44 Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Fri, 25 Mar 2016 08:11:27 + Zhang Qiang wrote:
>
> > Hi all,
> >
> > According to fio,
> Exact fio command please.
>
> >with 4k block size, the sequence write performance of
> > my ceph-fuse mount
>
> Exact mount options, ceph config (RBD cache) please.
>
> >is just about 20+ M/s, only 200 Mb of 1 Gb full
> > duplex NIC outgoing bandwidth was used for maximum. But for 1M block
> > size the performance could achieve as high as 1000 M/s, approaching the
> > limit of the NIC bandwidth. Why the performance stats differs so mush
> > for different block sizes?
> That's exactly why.
> You can see that with local attached storage as well, many small requests
> are slower than large (essential sequential) writes.
> Network attached storage in general (latency) and thus Ceph as well (plus
> code overhead) amplify that.
>
> >Can I configure ceph-fuse mount's block size
> > for maximum performance?
> >
> Very little to do with that if you're using sync writes (thus the fio
> command line pleasE), if not RBD cache could/should help.
>
> Christian
>
> > Basic information about the cluster: 20 OSDs on separate PCIe hard disks
> > distributed across 2 servers, each with write performance about 300 M/s;
> > 5 MONs; 1 MDS. Ceph version 0.94.6
> > (e832001feaf8c176593e0325c8298e3f16dfb403).
> >
> > Thanks :)
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-fuse huge performance gap between different block sizes

2016-03-25 Thread Zhang Qiang
Hi all,

According to fio, with 4k block size, the sequence write performance of my
ceph-fuse mount is just about 20+ M/s, only 200 Mb of 1 Gb full duplex NIC
outgoing bandwidth was used for maximum. But for 1M block size the
performance could achieve as high as 1000 M/s, approaching the limit of the
NIC bandwidth. Why the performance stats differs so mush for different
block sizes? Can I configure ceph-fuse mount's block size for maximum
performance?

Basic information about the cluster: 20 OSDs on separate PCIe hard disks
distributed across 2 servers, each with write performance about 300 M/s; 5
MONs; 1 MDS. Ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403).

Thanks :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help for PG problem

2016-03-23 Thread Zhang Qiang
I adjusted the crush map, everything's OK now. Thanks for your help!

On Wed, 23 Mar 2016 at 23:13 Matt Conner <matt.con...@keepertech.com> wrote:

> Hi Zhang,
>
> In a 2 copy pool, each placement group is spread across 2 OSDs - that is
> why you see such a high number of placement groups per OSD. There is a PG
> calculator at http://ceph.com/pgcalc/. Based on your setup, it may be
> worth using 2048 instead of 4096.
>
> As for stuck/degraded PGs, most are reporting as being on osd.0. Looking
> at your OSD Tree, you somehow have 21 OSDs being reported with 2 being
> labeled as osd.0; both up and in. I'd recommend trying to get rid of the
> one listed on host 148_96 and see if it clears the issues.
>
>
>
> On Tue, Mar 22, 2016 at 6:28 AM, Zhang Qiang <dotslash...@gmail.com>
> wrote:
>
>> Hi Reddy,
>> It's over a thousand lines, I pasted it on gist:
>> https://gist.github.com/dotSlashLu/22623b4cefa06a46e0d4
>>
>> On Tue, 22 Mar 2016 at 18:15 M Ranga Swami Reddy <swamire...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Can you please share the "ceph health detail" output?
>>>
>>> Thanks
>>> Swami
>>>
>>> On Tue, Mar 22, 2016 at 3:32 PM, Zhang Qiang <dotslash...@gmail.com>
>>> wrote:
>>> > Hi all,
>>> >
>>> > I have 20 OSDs and 1 pool, and, as recommended by the
>>> > doc(
>>> http://docs.ceph.com/docs/master/rados/operations/placement-groups/), I
>>> > configured pg_num and pgp_num to 4096, size 2, min size 1.
>>> >
>>> > But ceph -s shows:
>>> >
>>> > HEALTH_WARN
>>> > 534 pgs degraded
>>> > 551 pgs stuck unclean
>>> > 534 pgs undersized
>>> > too many PGs per OSD (382 > max 300)
>>> >
>>> > Why the recommended value, 4096, for 10 ~ 50 OSDs doesn't work?  And
>>> what
>>> > does it mean by "too many PGs per OSD (382 > max 300)"? If per OSD has
>>> 382
>>> > PGs I would have had 7640 PGs.
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] yum installed jewel doesn't provide systemd scripts

2016-05-02 Thread Zhang Qiang
I installed jewel el7 via yum on CentOS 7.1, but it seems no systemd
scripts are available. But I do find there's a folder named 'systemd' in
the source, so maybe we forgot to build it into the package?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] All OSDs are down with addr :/0

2016-05-08 Thread Zhang Qiang
Hi, I need help for deploying jewel OSDs on CentOS 7.

Following the guide, I have successfully run OSD daemons but all of them
are down according to `ceph -s`: 15/15 in osds are down

No errors in /var/log/ceph/ceph-osd.1.log, it just stoped at these lines
and never made progress:
2016-05-09 01:32:03.187802 7f35acb4a700  0 osd.0 100 crush map has features
2200130813952, adjusting msgr requires for clients
2016-05-09 01:32:03.187841 7f35acb4a700  0 osd.0 100 crush map has features
2200130813952 was 2199057080833, adjusting msgr requires for mons
2016-05-09 01:32:03.187859 7f35acb4a700  0 osd.0 100 crush map has features
2200130813952, adjusting msgr requires for osds

ceph health details shows:
osd.0 is down since epoch 0, last address :/0

Why the address is :/0? Am I configuring it wrong? I've followed the OSD
troubleshooting guide but with no luck.
And the network seems good, since the ports are telnet-able, and I can do
ceph -s on the OSD machine.

ceph.conf:
[global]
fsid = fad5f8d4-f5f6-425d-b035-a018614c0664

mon osd full ratio = .75
mon osd nearfull ratio = .65

auth cluster required = cephx
auth service requried = cephx
auth client required = cephx
mon initial members = mon_vm_1,mon_vm_2,mon_vm_3
mon host = 10.3.1.94,10.3.1.95,10.3.1.96

[mon.a]
host = mon_vm_1
mon addr = 10.3.1.94

[mon.b]
host = mon_vm_2
mon addr = 10.3.1.95

[mon.c]
host = mon_vm_3
mon addr = 10.3.1.96

[osd]
osd journal size = 10240
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 512
osd pool default pgp num = 512
osd crush chooseleaf type = 1
osd journal = /ceph_journal/$id
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94 OSD crashes

2016-10-24 Thread Zhang Qiang
Thanks Wang, looks like so, not Ceph to blame :)

On 25 October 2016 at 09:59, Haomai Wang <hao...@xsky.com> wrote:

> could you check dmesg? I think there exists disk EIO error
>
> On Tue, Oct 25, 2016 at 9:58 AM, Zhang Qiang <dotslash...@gmail.com>
> wrote:
>
>> Hi,
>>
>> One of several OSDs on the same machine crashed several times within
>> days. It's always that one, other OSDs are all fine. Below is the dumped
>> message, since it's too long here, I only pasted the head and tail of the
>> recent events. If it's necessary to inspect the full log, please see
>> https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244ac23f80.
>>
>> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function
>> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t,
>> ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24
>> 18:52:06.213123
>> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbc9195]
>>  2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
>> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]
>>  3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
>> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1]
>>  4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
>> std::allocator > const&, bool, unsigned int,
>> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8]
>>  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
>> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53]
>>  6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2)
>> [0x7df722]
>>  7: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
>> ThreadPool::TPHandle&)+0xbe) [0x6dcade]
>>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
>>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
>>  10: (()+0x7dc5) [0x7f309cd26dc5]
>>  11: (clone()+0x6d) [0x7f309b80821d]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>> --- begin dump of recent events ---
>> -1> 2016-10-24 18:51:34.341035 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940
>>  -> 2016-10-24 18:51:34.341046 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0
>>  -9998> 2016-10-24 18:51:34.341058 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080
>>  -9997> 2016-10-24 18:51:34.341069 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0
>>  -9996> 2016-10-24 18:51:34.341080 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00
>>  -9995> 2016-10-24 18:51:34.341090 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160
>>  -9994> 2016-10-24 18:51:34.341101 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60
>>  -9993> 2016-10-24 18:51:34.341113 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0
>>  -9992> 2016-10-24 18:51:34.341128 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0
>>  -9991> 2016-10-24 18:51:34.341139 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0
>>  -9990> 2016-10-24 18:51:34.341130 7f3088a48700  1 -- 10.3.149.62:0/25857
>> <== osd.1 10.3.149.55:6835/2010188 187557  osd_ping(ping_reply e301

[ceph-users] v0.94 OSD crashes

2016-10-24 Thread Zhang Qiang
Hi,

One of several OSDs on the same machine crashed several times within days.
It's always that one, other OSDs are all fine. Below is the dumped message,
since it's too long here, I only pasted the head and tail of the recent
events. If it's necessary to inspect the full log, please see
https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244ac23f80.

2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t,
ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24
18:52:06.213123
os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio ||
got != -5)

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbc9195]
 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]
 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1]
 4: (PGBackend::be_scan_list(ScrubMap&, std::vector const&, bool, unsigned int,
ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8]
 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53]
 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2)
[0x7df722]
 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbe)
[0x6dcade]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
 10: (()+0x7dc5) [0x7f309cd26dc5]
 11: (clone()+0x6d) [0x7f309b80821d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
-1> 2016-10-24 18:51:34.341035 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940
 -> 2016-10-24 18:51:34.341046 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0
 -9998> 2016-10-24 18:51:34.341058 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080
 -9997> 2016-10-24 18:51:34.341069 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0
 -9996> 2016-10-24 18:51:34.341080 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00
 -9995> 2016-10-24 18:51:34.341090 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160
 -9994> 2016-10-24 18:51:34.341101 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60
 -9993> 2016-10-24 18:51:34.341113 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0
 -9992> 2016-10-24 18:51:34.341128 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0
 -9991> 2016-10-24 18:51:34.341139 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0
 -9990> 2016-10-24 18:51:34.341130 7f3088a48700  1 -- 10.3.149.62:0/25857
<== osd.1 10.3.149.55:6835/2010188 187557  osd_ping(ping_reply e3014
stamp 2016-10-24 18:51:34.340550) v2  47+0+0 (1550182756 0 0)
0x1a83bc00 con 0x7874580
 -9989> 2016-10-24 18:51:34.341151 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.57:6814/26469 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x1f48aa00 con 0x175bfa20
 -9988> 2016-10-24 18:51:34.341162 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.62:6811/26469 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x24456e00 con 0x175bfb80
 -9987> 2016-10-24 18:51:34.341174 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.58:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x25c59e00 con 0x7874f20
 -9986> 2016-10-24 18:51:34.341186 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.63:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x19703c00 con 0x7875760
 -9985> 2016-10-24 18:51:34.341208 7f307b22d700  1 -- 10.3.149.62:0/25857
--> 10.3.149.58:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24
18:51:34.340550) v2 -- ?+0 0x19702600 con 0x26444940
 

[ceph-users] Behavior of ceph-fuse when network is down

2017-11-24 Thread Zhang Qiang
Hi all,

To observe what will happen to ceph-fuse mount if the network is down, we
blocked
network connections to all three monitors by iptables. If we restore the
network
immediately(within minutes), the blocked I/O request will be restored,
every thing will
be back to normal.

But if we continue to block it long enough, say twenty minutes, ceph-fuse
will not be
able to restore. The ceph-fuse process is still there, but will not be able
to handle I/O
operations, df or ls will hang indefinitely.

What is the retry policy of ceph-fuse? Is it normal for ceph-fuse to hang
after the
network blocking? If so, how can I make it restore to normal after the
network is
recovered? If it is not normal, what might be the cause? How can I help to
debug this?

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Behavior of ceph-fuse when network is down

2017-11-24 Thread Zhang Qiang
Thanks! I'll check it out.

2017年11月24日 17:58,"Yan, Zheng" <uker...@gmail.com>写道:

> On Fri, Nov 24, 2017 at 4:59 PM, Zhang Qiang <dotslash...@gmail.com>
> wrote:
> > Hi all,
> >
> > To observe what will happen to ceph-fuse mount if the network is down, we
> > blocked
> > network connections to all three monitors by iptables. If we restore the
> > network
> > immediately(within minutes), the blocked I/O request will be restored,
> every
> > thing will
> > be back to normal.
> >
> > But if we continue to block it long enough, say twenty minutes, ceph-fuse
> > will not be
> > able to restore. The ceph-fuse process is still there, but will not be
> able
> > to handle I/O
> > operations, df or ls will hang indefinitely.
> >
> > What is the retry policy of ceph-fuse? Is it normal for ceph-fuse to hang
> > after the
> > network blocking? If so, how can I make it restore to normal after the
> > network is
> > recovered? If it is not normal, what might be the cause? How can I help
> to
> > debug this?
>
> you can use 'kick_stale_sessions' ASOK command to make ceph-fuse
> reconnect, or set  'client_reconnect_stale' config option to true.
> Besides, you need to set mds config option
> 'mds_session_blacklist_on_timeout' to false.
>
> >
> > Thanks.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume created filestore journal bad header magic

2018-05-29 Thread Zhang Qiang
Hi all,

I'm new to Luminous, when I use ceph-volume create to add a new
filestore OSD, it will tell me that the journal's header magic is not
good. But the journal device is a new LV. How to make it write the new
OSD's header to the journal?

And it seems this error message will not affect the creation and start
of the OSD, but it complains the bad header magic  in the log every
time it boots.

journal _open /var/lib/ceph/osd/ceph-1/journal fd 30: 21474836480
bytes, block size 4096 bytes, directio = 1, aio = 1
journal do_read_entry(3922624512): bad header magic
journal do_read_entry(3922624512): bad header magic
journal _open /var/lib/ceph/osd/ceph-1/journal fd 30: 21474836480
bytes, block size 4096 bytes, directio = 1, aio = 1

Should I care about this? Is the OSD using the journal with bad magic
header normally?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FS Reclaims storage too slow

2018-06-25 Thread Zhang Qiang
Hi,

Is it normal that I deleted files from the cephfs and ceph didn't
delete the back objects a day later? Until I restart the mds deamon
then it started to release the storage space.

I noticed the doc(http://docs.ceph.com/docs/mimic/dev/delayed-delete/)
says the file is marked as deleted on the MDS, and deleted lazily.
What is the condition to trigger the back object deletion? If it's
normal the deletion delayed that much, is there any way to make it
faster? Since the cluster is near full.

I'm using jewel 10.2.3 both for ceph-fuse and mds.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can't get MDS running after a power outage

2018-03-29 Thread Zhang Qiang
Hi,

Ceph version 10.2.3. After a power outage, I tried to start the MDS
deamons, but they stuck forever replaying journals, I had no idea why
they were taking that long, because this is just a small cluster for
testing purpose with only hundreds MB data. I restarted them, and the
error below was encountered.

Any chance I can restore them?

Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon.
Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon...
Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255
7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and
will be forbidden in a future version.  MDS names may not start with a
numeric digit.
Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0
Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const
entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time
2018-03-28 14:20:30.942480
Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED assert(up.count(m))
Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3
(ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char
const*, char const*, int, char const*)+0x85) [0x7f01512aba45]
Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f)
[0x7f0150ee5e3f]
Mar 28 14:20:30 node01 ceph-mds: 3:
(MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9)
[0x7f0150ed6e49]
Mar 28 14:20:30 node01 ceph-mds: 4:
(MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d]
Mar 28 14:20:30 node01 ceph-mds: 5:
(MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3]
Mar 28 14:20:30 node01 ceph-mds: 6:
(MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b]
Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a)
[0x7f01513ad4aa]
Mar 28 14:20:30 node01 ceph-mds: 8:
(DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d]
Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5]
Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced]
Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or
`objdump -rdS ` is needed to interpret this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse segfaults

2018-04-02 Thread Zhang Qiang
 Hi,

I'm using ceph-fuse 10.2.3 on CentOS 7.3.1611. ceph-fuse always
segfaults after running for some time.

*** Caught signal (Segmentation fault) **
 in thread 7f455d832700 thread_name:ceph-fuse
 ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
 1: (()+0x2a442a) [0x7f457208e42a]
 2: (()+0xf5e0) [0x7f4570b895e0]
 3: (Client::get_root_ino()+0x10) [0x7f4571f86a20]
 4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x18d)
[0x7f4571f844bd]
 5: (()+0x19ae21) [0x7f4571f84e21]
 6: (()+0x164b5) [0x7f457199e4b5]
 7: (()+0x16bdb) [0x7f457199ebdb]
 8: (()+0x13471) [0x7f457199b471]
 9: (()+0x7e25) [0x7f4570b81e25]
 10: (clone()+0x6d) [0x7f456fa6934d]

Detailed events dump:
https://drive.google.com/file/d/0B_4ESJRu7BZFcHZmdkYtVG5CTGQ3UVFod0NxQloxS0ZCZmQ0/view?usp=sharing
Let me know if more info is needed.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse segfaults

2018-04-02 Thread Zhang Qiang
Thanks Patrick,
I should have checked the tracker first.
I'll try the kernel client and a upgrade to see if it resolves.

On 2 April 2018 at 22:29, Patrick Donnelly <pdonn...@redhat.com> wrote:
> Probably fixed by this: http://tracker.ceph.com/issues/17206
>
> You need to upgrade your version of ceph-fuse.
>
> On Mon, Apr 2, 2018 at 12:56 AM, Zhang Qiang <dotslash...@gmail.com> wrote:
>> Hi,
>>
>> I'm using ceph-fuse 10.2.3 on CentOS 7.3.1611. ceph-fuse always
>> segfaults after running for some time.
>>
>> *** Caught signal (Segmentation fault) **
>>  in thread 7f455d832700 thread_name:ceph-fuse
>>  ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>>  1: (()+0x2a442a) [0x7f457208e42a]
>>  2: (()+0xf5e0) [0x7f4570b895e0]
>>  3: (Client::get_root_ino()+0x10) [0x7f4571f86a20]
>>  4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x18d)
>> [0x7f4571f844bd]
>>  5: (()+0x19ae21) [0x7f4571f84e21]
>>  6: (()+0x164b5) [0x7f457199e4b5]
>>  7: (()+0x16bdb) [0x7f457199ebdb]
>>  8: (()+0x13471) [0x7f457199b471]
>>  9: (()+0x7e25) [0x7f4570b81e25]
>>  10: (clone()+0x6d) [0x7f456fa6934d]
>>
>> Detailed events dump:
>> https://drive.google.com/file/d/0B_4ESJRu7BZFcHZmdkYtVG5CTGQ3UVFod0NxQloxS0ZCZmQ0/view?usp=sharing
>> Let me know if more info is needed.
>>
>> Thanks.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com