[ceph-users] 答复: Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

2017-06-21 Thread 许雪寒
I set mon_data to “/home/ceph/software/ceph/var/lib/ceph/mon”, and its owner 
has always been “ceph” since we were running Hammer.
And I also tried to set the permission to “777”, it also didn’t work.

发件人: Linh Vu [mailto:v...@unimelb.edu.au]
发送时间: 2017年6月22日 14:26
收件人: 许雪寒; ceph-users@lists.ceph.com
主题: Re: [ceph-users] Can't start ceph-mon through systemctl start 
ceph-mon@.service after upgrading from Hammer to Jewel


Permissions of your mon data directory under /var/lib/ceph/mon/ might have 
changed as part of Hammer -> Jewel upgrade. Have you had a look there?


From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of 许雪寒 mailto:xuxue...@360.cn>>
Sent: Thursday, 22 June 2017 3:32:45 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Can't start ceph-mon through systemctl start 
ceph-mon@.service after upgrading from Hammer to Jewel

Hi, everyone.

I upgraded one of our ceph clusters from Hammer to Jewel. After upgrading, I 
can’t start ceph-mon through “systemctl start ceph-mon@ceph1”, while, on the 
other hand, I can start ceph-mon, either as user ceph or root, if I directly 
call “/usr/bin/ceph-mon �Ccluster ceph �Cid ceph1 �Csetuser ceph �Csetgroup 
ceph”. I looked “/var/log/messages”, and find that the reason systemctl can’t 
start ceph-mon is that ceph-mon can’t access its configured data directory. Why 
ceph-mon can’t access its data directory when its called by systemctl?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

2017-06-21 Thread Linh Vu
Permissions of your mon data directory under /var/lib/ceph/mon/ might have 
changed as part of Hammer -> Jewel upgrade. Have you had a look there?


From: ceph-users  on behalf of 许雪寒 

Sent: Thursday, 22 June 2017 3:32:45 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Can't start ceph-mon through systemctl start 
ceph-mon@.service after upgrading from Hammer to Jewel

Hi, everyone.

I upgraded one of our ceph clusters from Hammer to Jewel. After upgrading, I 
can’t start ceph-mon through “systemctl start ceph-mon@ceph1”, while, on the 
other hand, I can start ceph-mon, either as user ceph or root, if I directly 
call “/usr/bin/ceph-mon –cluster ceph –id ceph1 –setuser ceph –setgroup ceph”. 
I looked “/var/log/messages”, and find that the reason systemctl can’t start 
ceph-mon is that ceph-mon can’t access its configured data directory. Why 
ceph-mon can’t access its data directory when its called by systemctl?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD OSD's Dual Use

2017-06-21 Thread Ashley Merrick
Hello,


Currently have a pool of SSD's running as a Cache in front of a EC Pool.

The cache is very under used and the SSD's spend most time idle, would like to 
create a small SSD Pool for a selection of very small RBD disk's as scratch 
disks within the OS, should I expect any issues running the two pool's (Cache + 
RBD Data) on the same set of SSD's?


,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can't start ceph-mon through systemctl start ceph-mon@.service after upgrading from Hammer to Jewel

2017-06-21 Thread 许雪寒
Hi, everyone.

I upgraded one of our ceph clusters from Hammer to Jewel. After upgrading, I 
can’t start ceph-mon through “systemctl start ceph-mon@ceph1”, while, on the 
other hand, I can start ceph-mon, either as user ceph or root, if I directly 
call “/usr/bin/ceph-mon �Ccluster ceph �Cid ceph1 �Csetuser ceph �Csetgroup 
ceph”. I looked “/var/log/messages”, and find that the reason systemctl can’t 
start ceph-mon is that ceph-mon can’t access its configured data directory. Why 
ceph-mon can’t access its data directory when its called by systemctl?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Kernel RBD client talking to multiple storage clusters

2017-06-21 Thread Alex Gorbachev
On Mon, Jun 19, 2017 at 3:12 AM Wido den Hollander  wrote:

>
> > Op 19 juni 2017 om 5:15 schreef Alex Gorbachev :
> >
> >
> > Has anyone run into such config where a single client consumes storage
> from
> > several ceph clusters, unrelated to each other (different MONs and OSDs,
> > and keys)?
> >
>
> Should be possible, you can simply supply a different ceph.conf using the
> "-c" flag for the 'rbd' command and thus point to a different cluster.


Oh, and use --keyring to specify the right one. Thanks.  Will test shortly.

Alex


>
> Wido
>
> > We have a Hammer and a Jewel cluster now, and this may be a way to have
> > very clean migrations.
> >
> > Best regards,
> > Alex
> > Storcium
> > --
> > --
> > Alex Gorbachev
> > Storcium
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-21 Thread Christian Balzer

Hello,

Hmm, gmail client not grokking quoting these days?

On Wed, 21 Jun 2017 20:40:48 -0500 Brady Deetz wrote:

> On Jun 21, 2017 8:15 PM, "Christian Balzer"  wrote:
> 
> On Wed, 21 Jun 2017 19:44:08 -0500 Brady Deetz wrote:
> 
> > Hello,
> > I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have  
> 12
> > osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> > drives providing 10GB journals for groups of 12 6TB spinning rust drives
> > and 2x lacp 40gbps ethernet.
> >
> > Our hardware provider is recommending that we start deploying P4600 drives
> > in place of our P3700s due to availability.
> >  
> Welcome to the club and make sure to express your displeasure about
> Intel's "strategy" to your vendor.
> 
> The P4600s are a poor replacement for P3700s and also still just
> "announced" according to ARK.
> 
> Are you happy with your current NVMes?
> Firstly as in, what is their wearout, are you expecting them to easily
> survive 5 years at the current rate?
> Secondly, how about speed? with 12 HDDs and 1GB/s write capacity of the
> NVMe I'd expect them to not be a bottleneck in nearly all real life
> situations.
> 
> Keep in mind that 1.6TB P4600 is going to last about as long as your 400GB
> P3700, so if wear-out is a concern, don't put more stress on them.
> 
> 
> Oddly enough, the Intel tools are telling me that we've only used about 10%
> of each drive's endurance over the past year. This honestly surprises me
> due to our workload, but maybe I'm thinking my researchers are doing more
> science than they actually are.
>
That's pretty impressive still, but also lets you do numbers as to what
kind of additional load you _may_ be able to consider, obviously not more
than twice the current amount to stay within 5 years before wearing
them out.

 
> 
> Also the P4600 is only slightly faster in writes than the P3700, so that's
> where putting more workload onto them is going to be a notable issue.
> 
> > I've seen some talk on here regarding this, but wanted to throw an idea
> > around. I was okay throwing away 280GB of fast capacity for the purpose of
> > providing reliable journals. But with as much free capacity as we'd have
> > with a 4600, maybe I could use that extra capacity as a cache tier for
> > writes on an rbd ec pool. If I wanted to go that route, I'd probably
> > replace several existing 3700s with 4600s to get additional cache  
> capacity.
> > But, that sounds risky...
> >  
> Risky as in high failure domain concentration and as mentioned above a
> cache-tier with obvious inline journals and thus twice the bandwidth needs
> will likely eat into the write speed capacity of the journals.
> 
> 
> Agreed. On the topic of journals and double bandwidth, am I correct in
> thinking that btrfs (as insane as it may be) does not require double
> bandwidth like xfs? Furthermore with bluestore being close to stable, will
> my architecture need to change?
> 
BTRFS at this point is indeed a bit insane, given the current levels of
support, issues (search the ML archives) and future developments. 
And you'll still wind up with double writes most likely, IIRC.

These aspects of Bluestore have been discussed here recently, too.
Your SSD/NVMe space requirements will go down, but if you want to have the
same speeds and more importantly low latencies you'll wind up with all
writes going through them again, so endurance wise you're still in that
"Lets make SSDs great again" hellhole. 

> 
> If (and seems to be a big IF) you can find them, the Samsung PM1725a 1.6TB
> seems to be a) cheaper and b) at 2GB/s write speed more likely to be
> suitable for double duty.
> Similar (slightly better on paper) endurance than then P4600, so keep that
> in mind, too.
> 
> 
> My vendor is an HPC vendor so /maybe/ they have access to these elusive
> creatures. In which case, how many do you want? Haha
> 
I was just looking at availability with a few google searches, our current
needs are amply satisfied with S37xx SSDs, no need for NVMes really.
But as things are going, maybe I'll be forced to Optane and friends simply
by lack of alternatives. 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-21 Thread Brady Deetz
On Jun 21, 2017 8:15 PM, "Christian Balzer"  wrote:

On Wed, 21 Jun 2017 19:44:08 -0500 Brady Deetz wrote:

> Hello,
> I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have
12
> osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> drives providing 10GB journals for groups of 12 6TB spinning rust drives
> and 2x lacp 40gbps ethernet.
>
> Our hardware provider is recommending that we start deploying P4600 drives
> in place of our P3700s due to availability.
>
Welcome to the club and make sure to express your displeasure about
Intel's "strategy" to your vendor.

The P4600s are a poor replacement for P3700s and also still just
"announced" according to ARK.

Are you happy with your current NVMes?
Firstly as in, what is their wearout, are you expecting them to easily
survive 5 years at the current rate?
Secondly, how about speed? with 12 HDDs and 1GB/s write capacity of the
NVMe I'd expect them to not be a bottleneck in nearly all real life
situations.

Keep in mind that 1.6TB P4600 is going to last about as long as your 400GB
P3700, so if wear-out is a concern, don't put more stress on them.


Oddly enough, the Intel tools are telling me that we've only used about 10%
of each drive's endurance over the past year. This honestly surprises me
due to our workload, but maybe I'm thinking my researchers are doing more
science than they actually are.


Also the P4600 is only slightly faster in writes than the P3700, so that's
where putting more workload onto them is going to be a notable issue.

> I've seen some talk on here regarding this, but wanted to throw an idea
> around. I was okay throwing away 280GB of fast capacity for the purpose of
> providing reliable journals. But with as much free capacity as we'd have
> with a 4600, maybe I could use that extra capacity as a cache tier for
> writes on an rbd ec pool. If I wanted to go that route, I'd probably
> replace several existing 3700s with 4600s to get additional cache
capacity.
> But, that sounds risky...
>
Risky as in high failure domain concentration and as mentioned above a
cache-tier with obvious inline journals and thus twice the bandwidth needs
will likely eat into the write speed capacity of the journals.


Agreed. On the topic of journals and double bandwidth, am I correct in
thinking that btrfs (as insane as it may be) does not require double
bandwidth like xfs? Furthermore with bluestore being close to stable, will
my architecture need to change?


If (and seems to be a big IF) you can find them, the Samsung PM1725a 1.6TB
seems to be a) cheaper and b) at 2GB/s write speed more likely to be
suitable for double duty.
Similar (slightly better on paper) endurance than then P4600, so keep that
in mind, too.


My vendor is an HPC vendor so /maybe/ they have access to these elusive
creatures. In which case, how many do you want? Haha


Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-21 Thread Christian Balzer
On Wed, 21 Jun 2017 19:44:08 -0500 Brady Deetz wrote:

> Hello,
> I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have 12
> osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> drives providing 10GB journals for groups of 12 6TB spinning rust drives
> and 2x lacp 40gbps ethernet.
> 
> Our hardware provider is recommending that we start deploying P4600 drives
> in place of our P3700s due to availability.
> 
Welcome to the club and make sure to express your displeasure about
Intel's "strategy" to your vendor.

The P4600s are a poor replacement for P3700s and also still just
"announced" according to ARK. 

Are you happy with your current NVMes?
Firstly as in, what is their wearout, are you expecting them to easily
survive 5 years at the current rate?
Secondly, how about speed? with 12 HDDs and 1GB/s write capacity of the
NVMe I'd expect them to not be a bottleneck in nearly all real life
situations.

Keep in mind that 1.6TB P4600 is going to last about as long as your 400GB
P3700, so if wear-out is a concern, don't put more stress on them.

Also the P4600 is only slightly faster in writes than the P3700, so that's
where putting more workload onto them is going to be a notable issue.

> I've seen some talk on here regarding this, but wanted to throw an idea
> around. I was okay throwing away 280GB of fast capacity for the purpose of
> providing reliable journals. But with as much free capacity as we'd have
> with a 4600, maybe I could use that extra capacity as a cache tier for
> writes on an rbd ec pool. If I wanted to go that route, I'd probably
> replace several existing 3700s with 4600s to get additional cache capacity.
> But, that sounds risky...
> 
Risky as in high failure domain concentration and as mentioned above a
cache-tier with obvious inline journals and thus twice the bandwidth needs
will likely eat into the write speed capacity of the journals.

If (and seems to be a big IF) you can find them, the Samsung PM1725a 1.6TB
seems to be a) cheaper and b) at 2GB/s write speed more likely to be
suitable for double duty.
Similar (slightly better on paper) endurance than then P4600, so keep that
in mind, too.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-21 Thread Brady Deetz
Hello,
I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have 12
osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
drives providing 10GB journals for groups of 12 6TB spinning rust drives
and 2x lacp 40gbps ethernet.

Our hardware provider is recommending that we start deploying P4600 drives
in place of our P3700s due to availability.

I've seen some talk on here regarding this, but wanted to throw an idea
around. I was okay throwing away 280GB of fast capacity for the purpose of
providing reliable journals. But with as much free capacity as we'd have
with a 4600, maybe I could use that extra capacity as a cache tier for
writes on an rbd ec pool. If I wanted to go that route, I'd probably
replace several existing 3700s with 4600s to get additional cache capacity.
But, that sounds risky...

What do you guys think?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph packages for Debian Stretch?

2017-06-21 Thread Christian Balzer

Hello,

On Wed, 21 Jun 2017 11:15:26 +0200 Fabian Grünbichler wrote:

> On Wed, Jun 21, 2017 at 05:30:02PM +0900, Christian Balzer wrote:
> > 
> > Hello,
> > 
> > On Wed, 21 Jun 2017 09:47:08 +0200 (CEST) Alexandre DERUMIER wrote:
> >   
> > > Hi,
> > > 
> > > Proxmox is maintening a ceph-luminous repo for stretch
> > > 
> > > http://download.proxmox.com/debian/ceph-luminous/
> > > 
> > > 
> > > git is here, with patches and modifications to get it work
> > > https://git.proxmox.com/?p=ceph.git;a=summary
> > >  
> > While this is probably helpful for the changes needed, my quest is for
> > Jewel (really all supported builds) for Stretch.
> > And not whenever Luminous gets released, but within the next 10 days.  
> 
> I think you should be able to just backport the needed commits from
> http://tracker.ceph.com/issues/19884 on top of v10.2.7, bump the version
> in debian/changelog and use dpkg-buildpackage (or wrapper of your
> choice) to rebuild the packages. Building takes a while though ;)
> 
> Alternatively use the slightly outdated stock Debian packages (based on
> 10.2.5 with slightly deviating packaging and the patches in [1]) and
> switch over to the official ones when they are available.
>
That's what I'm doing now of course, but was hoping to get to 10.2.7
before this cluster goes into production.

At least it seems that Jewel is at least on the backport list, so I'll
just twiddle my thumbs for the time being as I simply don't have the time
to hand-hold more FLOSSy bits.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon Create currently at the state of probing

2017-06-21 Thread David Turner
You can specify an option in ceph-deploy to tell it which release of ceph
to install, jewel, kraken, hammer, etc.  `ceph-deploy --release jewel`
would pin the command to using jewel instead of kraken.

While running a mixed environment is supported, it should always be tested
before assuming it will work for you in production.  The Mons are quick
enough to upgrade, I always do them together.  Following I upgrade half of
my OSDs in a test environment and leave it there for a couple weeks (or
until adequate testing is done) before upgrading the remaining OSDs and
again waiting until the testing is done, I would probably do the MDS before
the OSDs, but don't usually think about that since I don't have them in a
production environment.  Lastly I would test upgrading the clients (vm
hosts, RGW, kernel clients, etc) and test this state the most thoroughly.
In production I haven't had to worry about an upgrade taking longer than a
few hours with over 60 OSD nodes, 5 mons, and a dozen clients.  I just
don't see a need to run in a mixed environment in production, even if it is
supported.

Back to your problem with adding in the mon.  Do your existing mons know
about the third mon, or have you removed it from their running config?  It
might be worth double checking their config file and restarting the daemons
after you know they will pick up the correct settings.  It's hard for me to
help with this part as I've been lucky enough not to have any problems with
the docs online for this when it's come up.  I've replaced 5 mons without
any issues.  I didn't use ceph-deploy, except to install the packages,
though and did the manual steps for it.

Hopefully adding the mon back on Jewel fixes the issue.  That would be the
easiest outcome.  I don't know that the Ceph team has tested adding
upgraded mons to an old quorum.


On Wed, Jun 21, 2017 at 4:52 PM Jim Forde  wrote:

> David,
>
> Thanks for the reply.
>
>
>
> The scenario:
>
> Monitor node fails for whatever reason, Bad blocks in HD, or Motherboard
> fail, whatever.
>
>
>
> Procedure:
>
> Remove the monitor from the cluster, replace hardware, reinstall OS and
> add monitor to cluster.
>
>
>
> That is exactly what I did. However, my ceph-deploy node had already been
> upgraded to Kraken.
>
> The goal is to not use this as an upgrade path per se, but to recover from
> a failed monitor node in a cluster where there is an upgrade in progress.
>
>
>
> The upgrade notes for Jewel to Kraken say you may upgrade OSDs Monitors
> and MSDs in any order. Perhaps I am reading too much into this, but I took
> it as I could proceed with the upgrade at my leisure. Making sure each node
> is successfully upgraded before proceeding to the next node. The
> implication is that I can run the cluster with different version daemons
> (at least during the upgrade process).
>
>
>
> So that brings me to the problem at hand.
>
> What is the correct procedure for replacing a failed Monitor Node,
> especially if the failed Monitor is a mon_initial_member?
>
> Does it have to be the same version as the other Monitors in the cluster?
>
> I do have a public network statement in the ceph.conf file.
>
> The monitor r710e is listed as one of the mon_initial_members in ceph.conf
> with the correct IP address, but the error message is:
> “[r710e][WARNIN] r710e is not defined in `mon initial members`”
>
> Also “[r710e][WARNIN] monitor r710e does not exist in monmap”
> Should I manually inject r710e in the monmap?
>
>
>
>
>
>
>
> *INFO*
>
>
>
> *ceph.conf*
>
> cat /etc/ceph/ceph.conf
> [global]
> fsid = 0be01315-7928-4037-ae7c-1b0cd36e52b8
> mon_initial_members = r710g,r710f,r710e
> mon_host = 10.0.40.27,10.0.40.26,10.0.40.25
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public network = 10.0.40.0/24
> cluster network = 10.0.50.0/24
>
> [mon]
> mon host = r710g,r710f,r710e
> mon addr = 10.0.40.27:6789,10.0.40.26:6789,10.0.40.25:6789
>
>
>
>
>
> *monmap*
>
> monmaptool: monmap file /tmp/monmap
> epoch 12
> fsid 0be01315-7928-4037-ae7c-1b0cd36e52b8
> last_changed 2017-06-15 08:15:10.542055
> created 2016-11-17 11:42:18.481472
> 0: 10.0.40.26:6789/0 mon.r710f
> 1: 10.0.40.27:6789/0 mon.r710g
>
>
>
>
>
> *Status*
>
> ceph -s
> cluster 0be01315-7928-4037-ae7c-1b0cd36e52b8
>  health HEALTH_OK
>  monmap e12: 2 mons at {r710f=
> 10.0.40.26:6789/0,r710g=10.0.40.27:6789/0}
> election epoch 252, quorum 0,1 r710f,r710g
>  osdmap e7017: 16 osds: 16 up, 16 in
> flags sortbitwise,require_jewel_osds
>   pgmap v14484684: 256 pgs, 1 pools, 218 GB data, 56119 objects
> 661 GB used, 8188 GB / 8849 GB avail
>  256 active+clean
>   client io 8135 B/s rd, 44745 B/s wr, 0 op/s rd, 5 op/s wr
>
>
>
>
>
> PS.
>
>
>
> Tried this too
>
> ceph mon remove r710e
> mon.r710e does not exist or has already been removed
>
>
>
>
>
>
>
> *From:* David Turner [mailto:drakonst...@gmail.com]
> *Sent:* Monday, June 19, 2017 12:5

Re: [ceph-users] Mon Create currently at the state of probing

2017-06-21 Thread Jim Forde
David,
Thanks for the reply.

The scenario:
Monitor node fails for whatever reason, Bad blocks in HD, or Motherboard fail, 
whatever.

Procedure:
Remove the monitor from the cluster, replace hardware, reinstall OS and add 
monitor to cluster.

That is exactly what I did. However, my ceph-deploy node had already been 
upgraded to Kraken.
The goal is to not use this as an upgrade path per se, but to recover from a 
failed monitor node in a cluster where there is an upgrade in progress.

The upgrade notes for Jewel to Kraken say you may upgrade OSDs Monitors and 
MSDs in any order. Perhaps I am reading too much into this, but I took it as I 
could proceed with the upgrade at my leisure. Making sure each node is 
successfully upgraded before proceeding to the next node. The implication is 
that I can run the cluster with different version daemons (at least during the 
upgrade process).

So that brings me to the problem at hand.
What is the correct procedure for replacing a failed Monitor Node, especially 
if the failed Monitor is a mon_initial_member?
Does it have to be the same version as the other Monitors in the cluster?
I do have a public network statement in the ceph.conf file.
The monitor r710e is listed as one of the mon_initial_members in ceph.conf with 
the correct IP address, but the error message is:
“[r710e][WARNIN] r710e is not defined in `mon initial members`”
Also “[r710e][WARNIN] monitor r710e does not exist in monmap”
Should I manually inject r710e in the monmap?



INFO

ceph.conf
cat /etc/ceph/ceph.conf
[global]
fsid = 0be01315-7928-4037-ae7c-1b0cd36e52b8
mon_initial_members = r710g,r710f,r710e
mon_host = 10.0.40.27,10.0.40.26,10.0.40.25
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.0.40.0/24
cluster network = 10.0.50.0/24

[mon]
mon host = r710g,r710f,r710e
mon addr = 10.0.40.27:6789,10.0.40.26:6789,10.0.40.25:6789


monmap
monmaptool: monmap file /tmp/monmap
epoch 12
fsid 0be01315-7928-4037-ae7c-1b0cd36e52b8
last_changed 2017-06-15 08:15:10.542055
created 2016-11-17 11:42:18.481472
0: 10.0.40.26:6789/0 mon.r710f
1: 10.0.40.27:6789/0 mon.r710g


Status
ceph -s
cluster 0be01315-7928-4037-ae7c-1b0cd36e52b8
 health HEALTH_OK
 monmap e12: 2 mons at {r710f=10.0.40.26:6789/0,r710g=10.0.40.27:6789/0}
election epoch 252, quorum 0,1 r710f,r710g
 osdmap e7017: 16 osds: 16 up, 16 in
flags sortbitwise,require_jewel_osds
  pgmap v14484684: 256 pgs, 1 pools, 218 GB data, 56119 objects
661 GB used, 8188 GB / 8849 GB avail
 256 active+clean
  client io 8135 B/s rd, 44745 B/s wr, 0 op/s rd, 5 op/s wr


PS.

Tried this too
ceph mon remove r710e
mon.r710e does not exist or has already been removed



From: David Turner [mailto:drakonst...@gmail.com]
Sent: Monday, June 19, 2017 12:58 PM
To: Jim Forde ; Sasha Litvak 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mon Create currently at the state of probing

Question... Why are you reinstalling the node, removing the mon from the 
cluster, and adding it back into the cluster to upgrade to Kraken?  The upgrade 
path from 10.2.5 to 11.2.0 is an acceptable upgrade path.  If you just needed 
to reinstall the OS for some reason, then you can keep the 
/var/lib/ceph/mon/r710e/ folder in tact and not need to remove/re-add the mon 
to reisntall the OS.  Even if you upgraded from 14.04 to 16.04, this would 
work.  You would want to change the upstart file in the daemon's folder to 
systemd and make sure it works with systemctl just fine, but the daemon itself 
would be fine.

If you are hell-bent on doing this the hardest way I've ever heard of, then you 
might want to check out this Note from the docs for adding/removing a mon.  
Since you are far enough removed from the initial ceph-deploy, you have removed 
r710e from your configuration and if you don't have a public network statement 
in your ceph.conf file... that could be your problem for the probing.

http://docs.ceph.com/docs/kraken/rados/deployment/ceph-deploy-mon/
"

Note


When adding a monitor on a host that was not in hosts initially defined with 
the ceph-deploy new command, a public network statement needs to be added to 
the ceph.conf file."


On Mon, Jun 19, 2017 at 1:09 PM Jim Forde 
mailto:j...@mninc.net>> wrote:
No, I don’t think Ubuntu 14.04 has it enabled by default.
Double checked.
Sudo ufw status
Status: inactive.
No other symptoms of a firewall.

From: Sasha Litvak 
[mailto:alexander.v.lit...@gmail.com]
Sent: Sunday, June 18, 2017 11:10 PM
To: Jim Forde mailto:j...@mninc.net>>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mon Create currently at the state of probing

Do you have firewall on on new server by any chance?

On Sun, Jun 18, 2017 at 8:18 PM, Jim Forde 
mailto:j...@mninc.net>> wrote:
I have an eight node ceph cluster running Jewel 10.2.5.
One Ceph-Deploy node. Four

Re: [ceph-users] red IO hang (was disk timeouts in libvirt/qemu VMs...)

2017-06-21 Thread Jason Dillaman
Do your VMs or OSDs show blocked requests? If you disable scrub or
restart the blocked OSD, does the issue go away? If yes, it most
likely is this issue [1].

[1] http://tracker.ceph.com/issues/20041

On Wed, Jun 21, 2017 at 3:33 PM, Hall, Eric  wrote:
> The VMs are using stock Ubuntu14/16 images so yes, there is the default 
> “/sbin/fstrim –all” in /etc/cron.weekly/fstrim.
>
> --
> Eric
>
> On 6/21/17, 1:58 PM, "Jason Dillaman"  wrote:
>
> Are some or many of your VMs issuing periodic fstrims to discard
> unused extents?
>
> On Wed, Jun 21, 2017 at 2:36 PM, Hall, Eric  
> wrote:
> > After following/changing all suggested items (turning off exclusive-lock
> > (and associated object-map and fast-diff), changing host cache behavior,
> > etc.) this is still a blocking issue for many uses of our OpenStack/Ceph
> > installation.
> >
> >
> >
> > We have upgraded Ceph to 10.2.7, are running 4.4.0-62 or later kernels 
> on
> > all storage, compute hosts, and VMs, with libvirt 1.3.1 on compute 
> hosts.
> > Have also learned quite a bit about producing debug logs. ;)
> >
> >
> >
> > I’ve followed the related threads since March with bated breath, but 
> still
> > find no resolution.
> >
> >
> >
> > Previously, I got pulled away before I could produce/report discussed 
> debug
> > info, but am back on the case now. Please let me know how I can help
> > diagnose and resolve this problem.
> >
> >
> >
> > Any assistance appreciated,
> >
> > --
> >
> > Eric
> >
> >
> >
> > On 3/28/17, 3:05 AM, "Marius Vaitiekunas" 
> > wrote:
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 27, 2017 at 11:17 PM, Peter Maloney
> >  wrote:
> >
> > I can't guarantee it's the same as my issue, but from that it sounds the
> > same.
> >
> > Jewel 10.2.4, 10.2.5 tested
> > hypervisors are proxmox qemu-kvm, using librbd
> > 3 ceph nodes with mon+osd on each
> >
> > -faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, iops
> > and bw limits on client side, jumbo frames, etc. all improve/smooth out
> > performance and mitigate the hangs, but don't prevent it.
> > -hangs are usually associated with blocked requests (I set the complaint
> > time to 5s to see them)
> > -hangs are very easily caused by rbd snapshot + rbd export-diff to do
> > incremental backup (one snap persistent, plus one more during backup)
> > -when qemu VM io hangs, I have to kill -9 the qemu process for it to
> > stop. Some broken VMs don't appear to be hung until I try to live
> > migrate them (live migrating all VMs helped test solutions)
> >
> > Finally I have a workaround... disable exclusive-lock, object-map, and
> > fast-diff rbd features (and restart clients via live migrate).
> > (object-map and fast-diff appear to have no effect on dif or export-diff
> > ... so I don't miss them). I'll file a bug at some point (after I move
> > all VMs back and see if it is still stable). And one other user on IRC
> > said this solved the same problem (also using rbd snapshots).
> >
> > And strangely, they don't seem to hang if I put back those features,
> > until a few days later (making testing much less easy...but now I'm very
> > sure removing them prevents the issue)
> >
> > I hope this works for you (and maybe gets some attention from devs too),
> > so you don't waste months like me.
> >
> >
> > On 03/27/17 19:31, Hall, Eric wrote:
> >> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 
> jewel),
> >> using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute 
> and
> >> ceph hosts, we occasionally see hung processes (usually during boot, 
> but
> >> otherwise as well), with errors reported in the instance logs as shown
> >> below.  Configuration is vanilla, based on openstack/ceph docs.
> >>
> >> Neither the compute hosts nor the ceph hosts appear to be overloaded in
> >> terms of memory or network bandwidth, none of the 67 osds are over 80% 
> full,
> >> nor do any of them appear to be overwhelmed in terms of IO.  Compute 
> hosts
> >> and ceph cluster are connected via a relatively quiet 1Gb network, 
> with an
> >> IBoE net between the ceph nodes.  Neither network appears overloaded.
> >>
> >> I don’t see any related (to my eye) errors in client or server logs, 
> even
> >> with 20/20 logging from various components (rbd, rados, client,
> >> objectcacher, etc.)  I’ve increased the qemu file descriptor limit
> >> (currently 64k... overkill for sure.)
> >>
> >> I “feels” like a performance problem, but I can’t find any capacity 
> issues
> >> or constraining bottlenecks.
> >>
> >> Any suggestions or insights into this situation are appreciat

Re: [ceph-users] red IO hang (was disk timeouts in libvirt/qemu VMs...)

2017-06-21 Thread Hall, Eric
The VMs are using stock Ubuntu14/16 images so yes, there is the default 
“/sbin/fstrim –all” in /etc/cron.weekly/fstrim.

-- 
Eric

On 6/21/17, 1:58 PM, "Jason Dillaman"  wrote:

Are some or many of your VMs issuing periodic fstrims to discard
unused extents?

On Wed, Jun 21, 2017 at 2:36 PM, Hall, Eric  
wrote:
> After following/changing all suggested items (turning off exclusive-lock
> (and associated object-map and fast-diff), changing host cache behavior,
> etc.) this is still a blocking issue for many uses of our OpenStack/Ceph
> installation.
>
>
>
> We have upgraded Ceph to 10.2.7, are running 4.4.0-62 or later kernels on
> all storage, compute hosts, and VMs, with libvirt 1.3.1 on compute hosts.
> Have also learned quite a bit about producing debug logs. ;)
>
>
>
> I’ve followed the related threads since March with bated breath, but still
> find no resolution.
>
>
>
> Previously, I got pulled away before I could produce/report discussed 
debug
> info, but am back on the case now. Please let me know how I can help
> diagnose and resolve this problem.
>
>
>
> Any assistance appreciated,
>
> --
>
> Eric
>
>
>
> On 3/28/17, 3:05 AM, "Marius Vaitiekunas" 
> wrote:
>
>
>
>
>
>
>
> On Mon, Mar 27, 2017 at 11:17 PM, Peter Maloney
>  wrote:
>
> I can't guarantee it's the same as my issue, but from that it sounds the
> same.
>
> Jewel 10.2.4, 10.2.5 tested
> hypervisors are proxmox qemu-kvm, using librbd
> 3 ceph nodes with mon+osd on each
>
> -faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, iops
> and bw limits on client side, jumbo frames, etc. all improve/smooth out
> performance and mitigate the hangs, but don't prevent it.
> -hangs are usually associated with blocked requests (I set the complaint
> time to 5s to see them)
> -hangs are very easily caused by rbd snapshot + rbd export-diff to do
> incremental backup (one snap persistent, plus one more during backup)
> -when qemu VM io hangs, I have to kill -9 the qemu process for it to
> stop. Some broken VMs don't appear to be hung until I try to live
> migrate them (live migrating all VMs helped test solutions)
>
> Finally I have a workaround... disable exclusive-lock, object-map, and
> fast-diff rbd features (and restart clients via live migrate).
> (object-map and fast-diff appear to have no effect on dif or export-diff
> ... so I don't miss them). I'll file a bug at some point (after I move
> all VMs back and see if it is still stable). And one other user on IRC
> said this solved the same problem (also using rbd snapshots).
>
> And strangely, they don't seem to hang if I put back those features,
> until a few days later (making testing much less easy...but now I'm very
> sure removing them prevents the issue)
>
> I hope this works for you (and maybe gets some attention from devs too),
> so you don't waste months like me.
>
>
> On 03/27/17 19:31, Hall, Eric wrote:
>> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel),
>> using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and
>> ceph hosts, we occasionally see hung processes (usually during boot, but
>> otherwise as well), with errors reported in the instance logs as shown
>> below.  Configuration is vanilla, based on openstack/ceph docs.
>>
>> Neither the compute hosts nor the ceph hosts appear to be overloaded in
>> terms of memory or network bandwidth, none of the 67 osds are over 80% 
full,
>> nor do any of them appear to be overwhelmed in terms of IO.  Compute 
hosts
>> and ceph cluster are connected via a relatively quiet 1Gb network, with 
an
>> IBoE net between the ceph nodes.  Neither network appears overloaded.
>>
>> I don’t see any related (to my eye) errors in client or server logs, even
>> with 20/20 logging from various components (rbd, rados, client,
>> objectcacher, etc.)  I’ve increased the qemu file descriptor limit
>> (currently 64k... overkill for sure.)
>>
>> I “feels” like a performance problem, but I can’t find any capacity 
issues
>> or constraining bottlenecks.
>>
>> Any suggestions or insights into this situation are appreciated.  Thank
>> you for your time,
>> --
>> Eric
>>
>>
>> [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more
>> than 120 seconds.
>> [Fri Mar 24 20:30:40 2017]   Not tainted 3.13.0-52-generic #85-Ubuntu
>> [Fri Mar 24 20:30:40 2017] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Fri Mar 24 20:30:40 2017] jbd2/vda1-8 D 88043fd13180 0   226
>> 2 0x
>> [Fri Mar 24 20:30:4

Re: [ceph-users] red IO hang (was disk timeouts in libvirt/qemu VMs...)

2017-06-21 Thread Jason Dillaman
Are some or many of your VMs issuing periodic fstrims to discard
unused extents?

On Wed, Jun 21, 2017 at 2:36 PM, Hall, Eric  wrote:
> After following/changing all suggested items (turning off exclusive-lock
> (and associated object-map and fast-diff), changing host cache behavior,
> etc.) this is still a blocking issue for many uses of our OpenStack/Ceph
> installation.
>
>
>
> We have upgraded Ceph to 10.2.7, are running 4.4.0-62 or later kernels on
> all storage, compute hosts, and VMs, with libvirt 1.3.1 on compute hosts.
> Have also learned quite a bit about producing debug logs. ;)
>
>
>
> I’ve followed the related threads since March with bated breath, but still
> find no resolution.
>
>
>
> Previously, I got pulled away before I could produce/report discussed debug
> info, but am back on the case now. Please let me know how I can help
> diagnose and resolve this problem.
>
>
>
> Any assistance appreciated,
>
> --
>
> Eric
>
>
>
> On 3/28/17, 3:05 AM, "Marius Vaitiekunas" 
> wrote:
>
>
>
>
>
>
>
> On Mon, Mar 27, 2017 at 11:17 PM, Peter Maloney
>  wrote:
>
> I can't guarantee it's the same as my issue, but from that it sounds the
> same.
>
> Jewel 10.2.4, 10.2.5 tested
> hypervisors are proxmox qemu-kvm, using librbd
> 3 ceph nodes with mon+osd on each
>
> -faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, iops
> and bw limits on client side, jumbo frames, etc. all improve/smooth out
> performance and mitigate the hangs, but don't prevent it.
> -hangs are usually associated with blocked requests (I set the complaint
> time to 5s to see them)
> -hangs are very easily caused by rbd snapshot + rbd export-diff to do
> incremental backup (one snap persistent, plus one more during backup)
> -when qemu VM io hangs, I have to kill -9 the qemu process for it to
> stop. Some broken VMs don't appear to be hung until I try to live
> migrate them (live migrating all VMs helped test solutions)
>
> Finally I have a workaround... disable exclusive-lock, object-map, and
> fast-diff rbd features (and restart clients via live migrate).
> (object-map and fast-diff appear to have no effect on dif or export-diff
> ... so I don't miss them). I'll file a bug at some point (after I move
> all VMs back and see if it is still stable). And one other user on IRC
> said this solved the same problem (also using rbd snapshots).
>
> And strangely, they don't seem to hang if I put back those features,
> until a few days later (making testing much less easy...but now I'm very
> sure removing them prevents the issue)
>
> I hope this works for you (and maybe gets some attention from devs too),
> so you don't waste months like me.
>
>
> On 03/27/17 19:31, Hall, Eric wrote:
>> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel),
>> using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and
>> ceph hosts, we occasionally see hung processes (usually during boot, but
>> otherwise as well), with errors reported in the instance logs as shown
>> below.  Configuration is vanilla, based on openstack/ceph docs.
>>
>> Neither the compute hosts nor the ceph hosts appear to be overloaded in
>> terms of memory or network bandwidth, none of the 67 osds are over 80% full,
>> nor do any of them appear to be overwhelmed in terms of IO.  Compute hosts
>> and ceph cluster are connected via a relatively quiet 1Gb network, with an
>> IBoE net between the ceph nodes.  Neither network appears overloaded.
>>
>> I don’t see any related (to my eye) errors in client or server logs, even
>> with 20/20 logging from various components (rbd, rados, client,
>> objectcacher, etc.)  I’ve increased the qemu file descriptor limit
>> (currently 64k... overkill for sure.)
>>
>> I “feels” like a performance problem, but I can’t find any capacity issues
>> or constraining bottlenecks.
>>
>> Any suggestions or insights into this situation are appreciated.  Thank
>> you for your time,
>> --
>> Eric
>>
>>
>> [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more
>> than 120 seconds.
>> [Fri Mar 24 20:30:40 2017]   Not tainted 3.13.0-52-generic #85-Ubuntu
>> [Fri Mar 24 20:30:40 2017] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Fri Mar 24 20:30:40 2017] jbd2/vda1-8 D 88043fd13180 0   226
>> 2 0x
>> [Fri Mar 24 20:30:40 2017]  88003728bbd8 0046
>> 88042690 88003728bfd8
>> [Fri Mar 24 20:30:40 2017]  00013180 00013180
>> 88042690 88043fd13a18
>> [Fri Mar 24 20:30:40 2017]  88043ffb9478 0002
>> 811ef7c0 88003728bc50
>> [Fri Mar 24 20:30:40 2017] Call Trace:
>> [Fri Mar 24 20:30:40 2017]  [] ?
>> generic_block_bmap+0x50/0x50
>> [Fri Mar 24 20:30:40 2017]  [] io_schedule+0x9d/0x140
>> [Fri Mar 24 20:30:40 2017]  [] sleep_on_buffer+0xe/0x20
>> [Fri Mar 24 20:30:40 2017]  [] __wait_on_bit+0x62/0x90
>> [Fri Mar 24 20:30:40 2017]  [] ?
>> generic_block_bmap+0x50/0x50
>> [Fri Mar 2

Re: [ceph-users] red IO hang (was disk timeouts in libvirt/qemu VMs...)

2017-06-21 Thread Hall, Eric
After following/changing all suggested items (turning off exclusive-lock (and 
associated object-map and fast-diff), changing host cache behavior, etc.) this 
is still a blocking issue for many uses of our OpenStack/Ceph installation.

We have upgraded Ceph to 10.2.7, are running 4.4.0-62 or later kernels on all 
storage, compute hosts, and VMs, with libvirt 1.3.1 on compute hosts.  Have 
also learned quite a bit about producing debug logs. ;)

I’ve followed the related threads since March with bated breath, but still find 
no resolution.

Previously, I got pulled away before I could produce/report discussed debug 
info, but am back on the case now. Please let me know how I can help diagnose 
and resolve this problem.

Any assistance appreciated,
--
Eric

On 3/28/17, 3:05 AM, "Marius Vaitiekunas" 
mailto:mariusvaitieku...@gmail.com>> wrote:



On Mon, Mar 27, 2017 at 11:17 PM, Peter Maloney 
mailto:peter.malo...@brockmann-consult.de>> 
wrote:
I can't guarantee it's the same as my issue, but from that it sounds the
same.

Jewel 10.2.4, 10.2.5 tested
hypervisors are proxmox qemu-kvm, using librbd
3 ceph nodes with mon+osd on each

-faster journals, more disks, bcache, rbd_cache, fewer VMs on ceph, iops
and bw limits on client side, jumbo frames, etc. all improve/smooth out
performance and mitigate the hangs, but don't prevent it.
-hangs are usually associated with blocked requests (I set the complaint
time to 5s to see them)
-hangs are very easily caused by rbd snapshot + rbd export-diff to do
incremental backup (one snap persistent, plus one more during backup)
-when qemu VM io hangs, I have to kill -9 the qemu process for it to
stop. Some broken VMs don't appear to be hung until I try to live
migrate them (live migrating all VMs helped test solutions)

Finally I have a workaround... disable exclusive-lock, object-map, and
fast-diff rbd features (and restart clients via live migrate).
(object-map and fast-diff appear to have no effect on dif or export-diff
... so I don't miss them). I'll file a bug at some point (after I move
all VMs back and see if it is still stable). And one other user on IRC
said this solved the same problem (also using rbd snapshots).

And strangely, they don't seem to hang if I put back those features,
until a few days later (making testing much less easy...but now I'm very
sure removing them prevents the issue)

I hope this works for you (and maybe gets some attention from devs too),
so you don't waste months like me.

On 03/27/17 19:31, Hall, Eric wrote:
> In an OpenStack (mitaka) cloud, backed by a ceph cluster (10.2.6 jewel), 
> using libvirt/qemu (1.3.1/2.5) hypervisors on Ubuntu 14.04.5 compute and ceph 
> hosts, we occasionally see hung processes (usually during boot, but otherwise 
> as well), with errors reported in the instance logs as shown below.  
> Configuration is vanilla, based on openstack/ceph docs.
>
> Neither the compute hosts nor the ceph hosts appear to be overloaded in terms 
> of memory or network bandwidth, none of the 67 osds are over 80% full, nor do 
> any of them appear to be overwhelmed in terms of IO.  Compute hosts and ceph 
> cluster are connected via a relatively quiet 1Gb network, with an IBoE net 
> between the ceph nodes.  Neither network appears overloaded.
>
> I don’t see any related (to my eye) errors in client or server logs, even 
> with 20/20 logging from various components (rbd, rados, client, objectcacher, 
> etc.)  I’ve increased the qemu file descriptor limit (currently 64k... 
> overkill for sure.)
>
> I “feels” like a performance problem, but I can’t find any capacity issues or 
> constraining bottlenecks.
>
> Any suggestions or insights into this situation are appreciated.  Thank you 
> for your time,
> --
> Eric
>
>
> [Fri Mar 24 20:30:40 2017] INFO: task jbd2/vda1-8:226 blocked for more than 
> 120 seconds.
> [Fri Mar 24 20:30:40 2017]   Not tainted 3.13.0-52-generic #85-Ubuntu
> [Fri Mar 24 20:30:40 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [Fri Mar 24 20:30:40 2017] jbd2/vda1-8 D 88043fd13180 0   226 
>  2 0x
> [Fri Mar 24 20:30:40 2017]  88003728bbd8 0046 
> 88042690 88003728bfd8
> [Fri Mar 24 20:30:40 2017]  00013180 00013180 
> 88042690 88043fd13a18
> [Fri Mar 24 20:30:40 2017]  88043ffb9478 0002 
> 811ef7c0 88003728bc50
> [Fri Mar 24 20:30:40 2017] Call Trace:
> [Fri Mar 24 20:30:40 2017]  [] ? 
> generic_block_bmap+0x50/0x50
> [Fri Mar 24 20:30:40 2017]  [] io_schedule+0x9d/0x140
> [Fri Mar 24 20:30:40 2017]  [] sleep_on_buffer+0xe/0x20
> [Fri Mar 24 20:30:40 2017]  [] __wait_on_bit+0x62/0x90
> [Fri Mar 24 20:30:40 2017]  [] ? 
> generic_block_bmap+0x50/0x50
> [Fri Mar 24 20:30:40 2017]  [] 
> out_of_line_wait_on_bit+0x77/0x90
> [Fri Mar 24 20:30:40 2017]  [] ? 
> autoremove_wake_function+0x40/0x40
> [Fri Mar 24 20:30:40 2017]  [] __wait_on_buffer+0x2a/0x30
> [Fri Mar 24 20:

[ceph-users] OSD returns back and recovery process

2017-06-21 Thread Дмитрий Глушенок
Hello!

It is clear what happens after OSD goes OUT - PGs are backfilled to other OSDs 
and PGs whose primary copies were on lost OSD gets new primary OSDs. But when 
OSD returns back it looks like all that data, for which the OSD was holding 
primary copies, are read from that OSD and re-written to other OSDs (to 
secondary copies). Am I right? If so, what is the reason to re-read copies from 
returned OSD? Wouldn't it be cheaper to just track modified ones?

Thank you.

--
Dmitry Glushenok
Jet Infosystems

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] risk mitigation in 2 replica clusters

2017-06-21 Thread ceph
You have a point, depends on your needs
Based on recovery time and usage, I may find acceptable to lock write
during recovery

Thanks you for that insight

On 21/06/2017 18:47, David Turner wrote:
> I disagree that Replica 2 will ever truly be sane if you care about your
> data.  The biggest issue with replica 2 has nothing to do with drive
> failures, restarting osds/nodes, power outages, etc.  The biggest issue
> with replica 2 is the min_size.  If you set min_size to 2, then your data
> is locked if you have any copy of the data unavailable.  That's fine since
> you were probably going to set min_size to 1... which you should never do
> ever unless you don't care about your data.
> 
> Too many pronouns, so we're going to say disk 1 and disk 2 are in charge of
> a pg and the only 2 disks with a copy of the data.
> The problem with a min_size of 1 is that if for any reason disk 1 is
> inaccessible and a write is made to disk 2, then before disk 1 is fully
> backfilled and caught up on all of the writes, disk 2 goes down... well now
> your data is inaccessible, but that's not the issue.  The issue is when
> disk 1 comes back up first and the client tries to access the data that it
> wrote earlier to disk 2... except the data isn't there.  The client is
> probably just showing an error somewhere and continuing.  Now it makes some
> writes to disk 1 before disk 2 finishes coming back up.  What can these 2
> disks possibly do to ensure that your data is consistent when both of them
> are back up?
> 
> Now of course we reach THE QUESTION... How likely is this to ever happen
> and what sort of things could cause it if not disk failures or performing
> maintenance on your cluster?  The answer to that is more common than you'd
> like to think.  Does your environment have enough RAM in your OSD nodes to
> adequately handle recovery and not cycle into an OOM killer scenario?  Will
> you ever hit a bug in the code that causes an operation to a PG to segfault
> an OSD?  Those are both things that have happened to multiple clusters I've
> managed and read about on the ML in the last year.  A min_size of 1 would
> very likely lead to data loss in either situation regardless of power
> failures and disk failures.
> 
> Now let's touch back on disk failures.  While backfilling due to adding
> storage, removing storage, or just balancing your cluster you are much more
> likely to lose drives.  During normal operation in a cluster, I would lose
> about 6 drives in a year (2000+ OSDs).  During backfilling (especially
> adding multiple storage nodes), I would lose closer to 1-3 drives per major
> backfilling operation.
> 
> People keep asking about 2 replicas.  People keep saying it's going to be
> viable with bluestore.  I care about my data too much to ever consider it.
> If I was running a cluster where data loss was acceptable, then I would
> absolutely consider it.  If you're thinking about 5 nines of uptime, then 2
> replica will achieve that.  If you're talking about 100% data integrity,
> then 2 replica is not AND WILL NEVER BE for you (no matter what the release
> docs say about bluestore).  If space is your concern, start looking into
> Erasure Coding.   You can save more space and increase redundancy for the
> cost of some performance.
> 
> On Wed, Jun 21, 2017 at 10:56 AM  wrote:
> 
>> 2r on filestore == "I do not care about my data"
>>
>> This is not because of OSD's failure chance
>>
>> When you have a write error (ie data is badly written on the disk,
>> without error reported), your data is just corrupted without hope of
>> redemption
>>
>> Just as you expect your drives to die, expect your drives to "fail
>> silently"
>>
>> With replica 3 and beyond, data CAN be repaired using quorum
>>
>> Replica 2 will become sane is the next release, with bluestore, which
>> uses data checksums
>>
>> On 21/06/2017 16:51, Blair Bethwaite wrote:
>>> Hi all,
>>>
>>> I'm doing some work to evaluate the risks involved in running 2r storage
>>> pools. On the face of it my naive disk failure calculations give me 4-5
>>> nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary
>> disk
>>> failure based purely on chance of any 1 of the remaining 99 OSDs failing
>>> within recovery time). 5 nines is just fine for our purposes, but of
>> course
>>> multiple disk failures are only part of the story.
>>>
>>> The more problematic issue with 2r clusters is that any time you do
>> planned
>>> maintenance (our clusters spend much more time degraded because of
>> regular
>>> upkeep than because of real failures) you're suddenly drastically
>>> increasing the risk of data-loss. So I find myself wondering if there is
>> a
>>> way to tell Ceph I want an extra replica created for a particular PG or
>> set
>>> thereof, e.g., something that would enable the functional equivalent of:
>>> "this OSD/node is going to go offline so please create a 3rd replica in
>>> every PG it is participating in before we shutdown that/those OSD/s"...?

Re: [ceph-users] risk mitigation in 2 replica clusters

2017-06-21 Thread David Turner
I disagree that Replica 2 will ever truly be sane if you care about your
data.  The biggest issue with replica 2 has nothing to do with drive
failures, restarting osds/nodes, power outages, etc.  The biggest issue
with replica 2 is the min_size.  If you set min_size to 2, then your data
is locked if you have any copy of the data unavailable.  That's fine since
you were probably going to set min_size to 1... which you should never do
ever unless you don't care about your data.

Too many pronouns, so we're going to say disk 1 and disk 2 are in charge of
a pg and the only 2 disks with a copy of the data.
The problem with a min_size of 1 is that if for any reason disk 1 is
inaccessible and a write is made to disk 2, then before disk 1 is fully
backfilled and caught up on all of the writes, disk 2 goes down... well now
your data is inaccessible, but that's not the issue.  The issue is when
disk 1 comes back up first and the client tries to access the data that it
wrote earlier to disk 2... except the data isn't there.  The client is
probably just showing an error somewhere and continuing.  Now it makes some
writes to disk 1 before disk 2 finishes coming back up.  What can these 2
disks possibly do to ensure that your data is consistent when both of them
are back up?

Now of course we reach THE QUESTION... How likely is this to ever happen
and what sort of things could cause it if not disk failures or performing
maintenance on your cluster?  The answer to that is more common than you'd
like to think.  Does your environment have enough RAM in your OSD nodes to
adequately handle recovery and not cycle into an OOM killer scenario?  Will
you ever hit a bug in the code that causes an operation to a PG to segfault
an OSD?  Those are both things that have happened to multiple clusters I've
managed and read about on the ML in the last year.  A min_size of 1 would
very likely lead to data loss in either situation regardless of power
failures and disk failures.

Now let's touch back on disk failures.  While backfilling due to adding
storage, removing storage, or just balancing your cluster you are much more
likely to lose drives.  During normal operation in a cluster, I would lose
about 6 drives in a year (2000+ OSDs).  During backfilling (especially
adding multiple storage nodes), I would lose closer to 1-3 drives per major
backfilling operation.

People keep asking about 2 replicas.  People keep saying it's going to be
viable with bluestore.  I care about my data too much to ever consider it.
If I was running a cluster where data loss was acceptable, then I would
absolutely consider it.  If you're thinking about 5 nines of uptime, then 2
replica will achieve that.  If you're talking about 100% data integrity,
then 2 replica is not AND WILL NEVER BE for you (no matter what the release
docs say about bluestore).  If space is your concern, start looking into
Erasure Coding.   You can save more space and increase redundancy for the
cost of some performance.

On Wed, Jun 21, 2017 at 10:56 AM  wrote:

> 2r on filestore == "I do not care about my data"
>
> This is not because of OSD's failure chance
>
> When you have a write error (ie data is badly written on the disk,
> without error reported), your data is just corrupted without hope of
> redemption
>
> Just as you expect your drives to die, expect your drives to "fail
> silently"
>
> With replica 3 and beyond, data CAN be repaired using quorum
>
> Replica 2 will become sane is the next release, with bluestore, which
> uses data checksums
>
> On 21/06/2017 16:51, Blair Bethwaite wrote:
> > Hi all,
> >
> > I'm doing some work to evaluate the risks involved in running 2r storage
> > pools. On the face of it my naive disk failure calculations give me 4-5
> > nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary
> disk
> > failure based purely on chance of any 1 of the remaining 99 OSDs failing
> > within recovery time). 5 nines is just fine for our purposes, but of
> course
> > multiple disk failures are only part of the story.
> >
> > The more problematic issue with 2r clusters is that any time you do
> planned
> > maintenance (our clusters spend much more time degraded because of
> regular
> > upkeep than because of real failures) you're suddenly drastically
> > increasing the risk of data-loss. So I find myself wondering if there is
> a
> > way to tell Ceph I want an extra replica created for a particular PG or
> set
> > thereof, e.g., something that would enable the functional equivalent of:
> > "this OSD/node is going to go offline so please create a 3rd replica in
> > every PG it is participating in before we shutdown that/those OSD/s"...?
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listin

Re: [ceph-users] risk mitigation in 2 replica clusters

2017-06-21 Thread ceph
2r on filestore == "I do not care about my data"

This is not because of OSD's failure chance

When you have a write error (ie data is badly written on the disk,
without error reported), your data is just corrupted without hope of
redemption

Just as you expect your drives to die, expect your drives to "fail silently"

With replica 3 and beyond, data CAN be repaired using quorum

Replica 2 will become sane is the next release, with bluestore, which
uses data checksums

On 21/06/2017 16:51, Blair Bethwaite wrote:
> Hi all,
> 
> I'm doing some work to evaluate the risks involved in running 2r storage
> pools. On the face of it my naive disk failure calculations give me 4-5
> nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary disk
> failure based purely on chance of any 1 of the remaining 99 OSDs failing
> within recovery time). 5 nines is just fine for our purposes, but of course
> multiple disk failures are only part of the story.
> 
> The more problematic issue with 2r clusters is that any time you do planned
> maintenance (our clusters spend much more time degraded because of regular
> upkeep than because of real failures) you're suddenly drastically
> increasing the risk of data-loss. So I find myself wondering if there is a
> way to tell Ceph I want an extra replica created for a particular PG or set
> thereof, e.g., something that would enable the functional equivalent of:
> "this OSD/node is going to go offline so please create a 3rd replica in
> every PG it is participating in before we shutdown that/those OSD/s"...?
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] risk mitigation in 2 replica clusters

2017-06-21 Thread Blair Bethwaite
Hi all,

I'm doing some work to evaluate the risks involved in running 2r storage
pools. On the face of it my naive disk failure calculations give me 4-5
nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary disk
failure based purely on chance of any 1 of the remaining 99 OSDs failing
within recovery time). 5 nines is just fine for our purposes, but of course
multiple disk failures are only part of the story.

The more problematic issue with 2r clusters is that any time you do planned
maintenance (our clusters spend much more time degraded because of regular
upkeep than because of real failures) you're suddenly drastically
increasing the risk of data-loss. So I find myself wondering if there is a
way to tell Ceph I want an extra replica created for a particular PG or set
thereof, e.g., something that would enable the functional equivalent of:
"this OSD/node is going to go offline so please create a 3rd replica in
every PG it is participating in before we shutdown that/those OSD/s"...?

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Peter Maloney
On 06/14/17 11:59, Dan van der Ster wrote:
> Dear ceph users,
>
> Today we had O(100) slow requests which were caused by deep-scrubbing
> of the metadata log:
>
> 2017-06-14 11:07:55.373184 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
> deep-scrub starts
> ...
> 2017-06-14 11:22:04.143903 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
> request 480.140904 seconds old, received at 2017-06-14
> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
> scrub
> ...
> 2017-06-14 11:22:06.729306 osd.155
> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
> deep-scrub ok

This looks just like my problem in my thread on ceph-devel "another
scrub bug? blocked for > 10240.948831 secs​" except that your scrub
eventually finished (mine ran hours before I stopped it manually), and
I'm not using rgw.

Sage commented that it is likely related to snaps being removed at some
point and interacting with scrub.

Restarting the osd that is mentioned there (osd.155 in  your case) will
fix it for now. And tuning scrub changes the way it behaves (defaults
make it happen more rarely than what I had before).


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Degraded objects while OSD is being added/filled

2017-06-21 Thread Andras Pataki

Hi cephers,

I noticed something I don't understand about ceph's behavior when adding 
an OSD.  When I start with a clean cluster (all PG's active+clean) and 
add an OSD (via ceph-deploy for example), the crush map gets updated and 
PGs get reassigned to different OSDs, and the new OSD starts getting 
filled with data.  As the new OSD gets filled, I start seeing PGs in 
degraded states.  Here is an example:


  pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
3164 TB used, 781 TB / 3946 TB avail
   *8017/994261437 objects degraded (0.001%)*
2220581/994261437 objects misplaced (0.223%)
   42393 active+clean
  91 active+remapped+wait_backfill
   9 active+clean+scrubbing+deep
   *   1 active+recovery_wait+degraded*
   1 active+clean+scrubbing
   1 active+remapped+backfilling


Any ideas why there would be any persistent degradation in the cluster 
while the newly added drive is being filled?  It takes perhaps a day or 
two to fill the drive - and during all this time the cluster seems to be 
running degraded.  As data is written to the cluster, the number of 
degraded objects increases over time.  Once the newly added OSD is 
filled, the cluster comes back to clean again.


Here is the PG that is degraded in this picture:

7.87c10200419430477 
active+recovery_wait+degraded2017-06-20 14:12:44.119921 344610'7
583572:2797[402,521]402[402,521]402 344610'7
2017-06-16 06:04:55.822503344610'72017-06-16 06:04:55.822503


The newly added osd here is 521.  Before it got added, this PG had two 
replicas clean, but one got forgotten somehow?


Other remapped PGs have 521 in their "up" set but still have the two 
existing copies in their "acting" set - and no degradation is shown.  
Examples:


2.f2414282016285640510148508013102 3102
active+remapped+wait_backfill2017-06-20 14:12:42.650308
583553'2033479583573:2033266[467,521] 467[467,499]467
582430'2072017-06-16 09:08:51.055131582036'2030837
2017-05-31 20:37:54.831178
6.2b7d104990140209980372428746873673 3673
active+remapped+wait_backfill2017-06-20 14:12:42.070019
583569'165163583572:342128[541,37,521] 541[541,37,532]
541582430'1618902017-06-18 09:42:49.148402582430'161890
2017-06-18 09:42:49.148402


We are running the latest Jewel patch level everywhere (10.2.7). Any 
insights would be appreciated.


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Sage Weil
On Wed, 21 Jun 2017, Piotr Dałek wrote:
> > > > > I tested on few of our production images and it seems that about 30%
> > > > > is
> > > > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > > > PG grow, recovery).
> > > > > 
> > > > > How this is/will be handled in BlueStore?
> > > > 
> > > > BlueStore exposes the same sparseness metadata that enabling the
> > > > filestore seek hole or fiemap options does, so it won't be a problem
> > > > there.
> > > > 
> > > > I think the only thing that we could potentially add is zero detection
> > > > on writes (so that explicitly writing zeros consumes no space).  We'd
> > > > have to be a bit careful measuring the performance impact of that check
> > > > on
> > > > non-zero writes.
> > > 
> > > I saw that RBD (librbd) does that - replacing writes with discards when
> > > buffer
> > > contains only zeros. Some code that does the same in librados could be
> > > added
> > > and it shouldn't impact performance much, current implementation of
> > > mem_is_zero is fast and shouldn't be a big problem.
> > 
> > I'd rather not have librados silently translating requests; I think it
> > makes more sense to do any zero checking in bluestore.  _do_write_small
> > and _do_write_big already break writes into (aligned) chunks; that would
> > be an easy place to add the check.
> 
> That leaves out filestore.
> 
> And while I get your point, doing it on librados level would reduce network
> usage for zeroed out regions as well, and check could be done just once, not
> replica_size times...

In the librbd case I think a client-side check makes sense.

For librados, it's a low level interface with complicated semantics.  
Silently translating a write op to a zero op feels dangerous to me.  
Would a zero range extend the object size, for example?  Or implicitly 
create an object that doesn't exist?  I can't remember.  (It would need to 
match write perfectly for this to be safe.)  The user might also have a 
compound op of multiple operations, which would make swapping one out in 
the middle stranger.  And probably half the librados unit tests would 
stop testing what we thought they were testing.  Etc.

It seems more natural to do this a layer up in librbd or rgw...

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Casey Bodley
That patch looks reasonable. You could also try raising the values of 
osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on 
that osd in order to trim more at a time.


On 06/21/2017 09:27 AM, Dan van der Ster wrote:

Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
  to_index = op.to_marker;
}

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
size_t max_entries = MAX_TRIM_ENTRIES;

int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, &keys);


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:

Hi Dan,

That's good news that it can remove 1000 keys at a time without hitting
timeouts. The output of 'du' will depend on when the leveldb compaction
runs. If you do find that compaction leads to suicide timeouts on this osd
(you would see a lot of 'leveldb:' output in the log), consider running
offline compaction by adding 'leveldb compact on mount = true' to the osd
config and restarting.

Casey


On 06/19/2017 11:01 AM, Dan van der Ster wrote:

On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley  wrote:

On 06/14/2017 05:59 AM, Dan van der Ster wrote:

Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
it off when not using multisite?


It's a good idea to turn that off, yes.

First, make sure that you have configured a default realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm   (you can
determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default


Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.


Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take care
with future radosgw-admin commands on the zone/zonegroup, as they may
revert
log_meta back to true [1].


Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.


2. I started dumping the output of radosgw-admin mdlog list, and
cancelled it after a few minutes. It had already dumped 3GB of json
and I don't know how much more it would have written. Is something
supposed to be trimming the mdlog automatically?


There is automated mdlog trimming logic in master, but not jewel/kraken.
And
this logic won't be triggered if there is only one zone [2].


3. ceph df doesn't show the space occupied by omap objects -- is
there an indirect way to see how much space these are using?


You can inspect the osd's omap directory: du -sh
/var/lib/ceph/osd/osd0/current/omap


Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.


4. mdlog status has markers going back to 2016-10, see [2]. I suppose
we're not using this feature correctly? :-/

5. Suppose I were to set log_meta: false -- how would I delete these
log entries now that they are not needed?


There is a 'radosgw-admin mdlog trim' command that can be used to trim
them
one --shard-id (from 0 to 63) at a time.

Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Piotr Dałek

On 17-06-21 03:35 PM, Jason Dillaman wrote:

On Wed, Jun 21, 2017 at 3:05 AM, Piotr Dałek  wrote:

I saw that RBD (librbd) does that - replacing writes with discards when
buffer contains only zeros. Some code that does the same in librados could
be added and it shouldn't impact performance much, current implementation of
mem_is_zero is fast and shouldn't be a big problem.


I'm pretty sure the only place where librbd converts a write to a
discard is actually the specialized "writesame" operation used by
tcmu-runner, as an optimization for ESX's initialization of a new
image.


Still, I saw it! ;-)

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Piotr Dałek

On 17-06-21 03:24 PM, Sage Weil wrote:

On Wed, 21 Jun 2017, Piotr Dałek wrote:

On 17-06-14 03:44 PM, Sage Weil wrote:

On Wed, 14 Jun 2017, Paweł Sadowski wrote:

On 04/13/2017 04:23 PM, Piotr Dałek wrote:

On 04/06/2017 03:25 PM, Sage Weil wrote:

On Thu, 6 Apr 2017, Piotr Dałek wrote:

[snip]


I think the solution here is to use sparse_read during recovery.  The
PushOp data representation already supports it; it's just a matter of
skipping the zeros.  The recovery code could also have an option to
check
for fully-zero regions of the data and turn those into holes as
well.  For
ReplicatedBackend, see build_push_op().


So far it turns out that there's even easier solution, we just enabled
"filestore seek hole" on some test cluster and that seems to fix the
problem for us. We'll see if fiemap works too.



Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?


We would need to enable it in the qa environment first.  The risk here is
that users run a broad range of kernels and we are exposing ourselves to
any bugs in any kernel version they may run.  I'd prefer to leave it off
by default.


That's a common regression? If not, we could blacklist particular kernels and
call it a day.

>>

We can enable it in the qa suite, though, which covers
centos7 (latest kernel) and ubuntu xenial and trusty.


+1. Do you need some particular PR for that?


Sure.  How about a patch that adds the config option to several of the
files in qa/suites/rados/thrash/thrashers?


OK.


I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?


BlueStore exposes the same sparseness metadata that enabling the
filestore seek hole or fiemap options does, so it won't be a problem
there.

I think the only thing that we could potentially add is zero detection
on writes (so that explicitly writing zeros consumes no space).  We'd
have to be a bit careful measuring the performance impact of that check on
non-zero writes.


I saw that RBD (librbd) does that - replacing writes with discards when buffer
contains only zeros. Some code that does the same in librados could be added
and it shouldn't impact performance much, current implementation of
mem_is_zero is fast and shouldn't be a big problem.


I'd rather not have librados silently translating requests; I think it
makes more sense to do any zero checking in bluestore.  _do_write_small
and _do_write_big already break writes into (aligned) chunks; that would
be an easy place to add the check.


That leaves out filestore.

And while I get your point, doing it on librados level would reduce network 
usage for zeroed out regions as well, and check could be done just once, not 
replica_size times...


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Jason Dillaman
On Wed, Jun 21, 2017 at 3:05 AM, Piotr Dałek  wrote:
> I saw that RBD (librbd) does that - replacing writes with discards when
> buffer contains only zeros. Some code that does the same in librados could
> be added and it shouldn't impact performance much, current implementation of
> mem_is_zero is fast and shouldn't be a big problem.

I'm pretty sure the only place where librbd converts a write to a
discard is actually the specialized "writesame" operation used by
tcmu-runner, as an optimization for ESX's initialization of a new
image.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw: scrub causing slow requests in the md log

2017-06-21 Thread Dan van der Ster
Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
 to_index = op.to_marker;
   }

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
   size_t max_entries = MAX_TRIM_ENTRIES;

   int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, &keys);


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley  wrote:
> Hi Dan,
>
> That's good news that it can remove 1000 keys at a time without hitting
> timeouts. The output of 'du' will depend on when the leveldb compaction
> runs. If you do find that compaction leads to suicide timeouts on this osd
> (you would see a lot of 'leveldb:' output in the log), consider running
> offline compaction by adding 'leveldb compact on mount = true' to the osd
> config and restarting.
>
> Casey
>
>
> On 06/19/2017 11:01 AM, Dan van der Ster wrote:
>>
>> On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley  wrote:
>>>
>>> On 06/14/2017 05:59 AM, Dan van der Ster wrote:

 Dear ceph users,

 Today we had O(100) slow requests which were caused by deep-scrubbing
 of the metadata log:

 2017-06-14 11:07:55.373184 osd.155
 [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
 deep-scrub starts
 ...
 2017-06-14 11:22:04.143903 osd.155
 [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
 request 480.140904 seconds old, received at 2017-06-14
 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
 meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
 scrub
 ...
 2017-06-14 11:22:06.729306 osd.155
 [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
 deep-scrub ok

 We have log_meta: true, log_data: false on this (our only) region [1],
 which IIRC we setup to enable indexless buckets.

 I'm obviously unfamiliar with rgw meta and data logging, and have a
 few questions:

1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
 it off when not using multisite?
>>>
>>>
>>> It's a good idea to turn that off, yes.
>>>
>>> First, make sure that you have configured a default realm/zonegroup/zone:
>>>
>>> $ radosgw-admin realm default --rgw-realm   (you can
>>> determine
>>> realm name from 'radosgw-admin realm list')
>>> $ radosgw-admin zonegroup default --rgw-zonegroup default
>>> $ radosgw-admin zone default --rgw-zone default
>>>
>> Thanks. This had already been done, as confirmed with radosgw-admin
>> realm get-default.
>>
>>> Then you can modify the zonegroup (aka region):
>>>
>>> $ radosgw-admin zonegroup get > zonegroup.json
>>> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
>>> $ radosgw-admin zonegroup set < zonegroup.json
>>>
>>> Then commit the updated period configuration:
>>>
>>> $ radosgw-admin period update --commit
>>>
>>> Verify that the resulting period contains "log_meta": "false". Take care
>>> with future radosgw-admin commands on the zone/zonegroup, as they may
>>> revert
>>> log_meta back to true [1].
>>>
>> Great, this worked. FYI (and for others trying this in future), the
>> period update --commit blocks all rgws for ~30s while they reload the
>> realm.
>>
2. I started dumping the output of radosgw-admin mdlog list, and
 cancelled it after a few minutes. It had already dumped 3GB of json
 and I don't know how much more it would have written. Is something
 supposed to be trimming the mdlog automatically?
>>>
>>>
>>> There is automated mdlog trimming logic in master, but not jewel/kraken.
>>> And
>>> this logic won't be triggered if there is only one zone [2].
>>>
3. ceph df doesn't show the space occupied by omap objects -- is
 there an indirect way to see how much space these are using?
>>>
>>>
>>> You can inspect the osd's omap directory: du -sh
>>> /var/lib/ceph/osd/osd0/current/omap
>>>
>> Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
>> ~100-300MB on other OSDs.
>>
4. mdlog status has markers going back to 2016-10, see [2]. I suppose
 we're not using this feature correctly? :-/

5. Suppose I were to set log_meta: false -- how would I delete these
 log entries now that

Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Sage Weil
On Wed, 21 Jun 2017, Piotr Dałek wrote:
> On 17-06-14 03:44 PM, Sage Weil wrote:
> > On Wed, 14 Jun 2017, Paweł Sadowski wrote:
> > > On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> > > > On 04/06/2017 03:25 PM, Sage Weil wrote:
> > > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > > > > [snip]
> > > > > 
> > > > > I think the solution here is to use sparse_read during recovery.  The
> > > > > PushOp data representation already supports it; it's just a matter of
> > > > > skipping the zeros.  The recovery code could also have an option to
> > > > > check
> > > > > for fully-zero regions of the data and turn those into holes as
> > > > > well.  For
> > > > > ReplicatedBackend, see build_push_op().
> > > > 
> > > > So far it turns out that there's even easier solution, we just enabled
> > > > "filestore seek hole" on some test cluster and that seems to fix the
> > > > problem for us. We'll see if fiemap works too.
> > > > 
> > > 
> > > Is it safe to enable "filestore seek hole", are there any tests that
> > > verifies that everything related to RBD works fine with this enabled?
> > > Can we make this enabled by default?
> > 
> > We would need to enable it in the qa environment first.  The risk here is
> > that users run a broad range of kernels and we are exposing ourselves to
> > any bugs in any kernel version they may run.  I'd prefer to leave it off
> > by default.
> 
> That's a common regression? If not, we could blacklist particular kernels and
> call it a day.
>  > We can enable it in the qa suite, though, which covers
> > centos7 (latest kernel) and ubuntu xenial and trusty.
> 
> +1. Do you need some particular PR for that?

Sure.  How about a patch that adds the config option to several of the 
files in qa/suites/rados/thrash/thrashers?

> > > I tested on few of our production images and it seems that about 30% is
> > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > PG grow, recovery).
> > > 
> > > How this is/will be handled in BlueStore?
> > 
> > BlueStore exposes the same sparseness metadata that enabling the
> > filestore seek hole or fiemap options does, so it won't be a problem
> > there.
> > 
> > I think the only thing that we could potentially add is zero detection
> > on writes (so that explicitly writing zeros consumes no space).  We'd
> > have to be a bit careful measuring the performance impact of that check on
> > non-zero writes.
> 
> I saw that RBD (librbd) does that - replacing writes with discards when buffer
> contains only zeros. Some code that does the same in librados could be added
> and it shouldn't impact performance much, current implementation of
> mem_is_zero is fast and shouldn't be a big problem.

I'd rather not have librados silently translating requests; I think it 
makes more sense to do any zero checking in bluestore.  _do_write_small 
and _do_write_big already break writes into (aligned) chunks; that would 
be an easy place to add the check.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flash for mon nodes ?

2017-06-21 Thread Wido den Hollander

> Op 21 juni 2017 om 12:38 schreef Osama Hasebou :
> 
> 
> Hi Guys, 
> 
> Has anyone used flash SSD drives for nodes hosting Monitor nodes only? 
> 
> If yes, any major benefits against just using SAS drives ? 
> 

Yes:
- Less latency
- Faster store compacting
- More reliable

Buy a Intel S3710 or Samsung SM863 SSD for your MONs. Something like 120GB ~ 
240GB.

Wido

> Thanks. 
> 
> Regards, 
> Ossi 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flash for mon nodes ?

2017-06-21 Thread Paweł Sadowski
On 06/21/2017 12:38 PM, Osama Hasebou wrote:
> Hi Guys,
>
> Has anyone used flash SSD drives for nodes hosting Monitor nodes only?
>
> If yes, any major benefits against just using SAS drives ?

We are using such setup for big (>500 OSDs) clusters. It makes it less
painful when such cluster rebalances for a long time. During such
rebalance monitor store can grow to some insane size (~70 GB in one
case). Such big store is hard to sync when monitor leaves/join quorum
for some reason (monitor timeouts on syncing). It can also make peering
process faster as all such changes (PG map) must go through monitor quorum.

-- 
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flash for mon nodes ?

2017-06-21 Thread Ashley Merrick
If you just mean normal DC rated SSD’s then that’s what I am running across a 
~120 OSD cluster.

When checking they are very unbusy and minimal use, however I can imagine the 
lower random latency will always help.

So if you can I would.

,Ashley
Sent from my iPhone

On 21 Jun 2017, at 6:39 PM, Osama Hasebou 
mailto:osama.hase...@csc.fi>> wrote:

Hi Guys,

Has anyone used flash SSD drives for nodes hosting Monitor nodes only?

If yes, any major benefits against just using SAS drives ?

Thanks.

Regards,
Ossi


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritise recovery on specific PGs/OSDs?

2017-06-21 Thread Piotr Dałek

On 17-06-20 02:44 PM, Richard Hesketh wrote:

Is there a way, either by individual PG or by OSD, I can prioritise 
backfill/recovery on a set of PGs which are currently particularly important to 
me?

For context, I am replacing disks in a 5-node Jewel cluster, on a node-by-node 
basis - mark out the OSDs on a node, wait for them to clear, replace OSDs, 
bring up and in, mark out the OSDs on the next set, etc. I've done my first 
node, but the significant CRUSH map changes means most of my data is moving. I 
only currently care about the PGs on my next set of OSDs to replace - the other 
remapped PGs I don't care about settling because they're only going to end up 
moving around again after I do the next set of disks. I do want the PGs 
specifically on the OSDs I am about to replace to backfill because I don't want 
to compromise data integrity by downing them while they host active PGs. If I 
could specifically prioritise the backfill on those PGs/OSDs, I could get on 
with replacing disks without worrying about causing degraded PGs.

I'm in a situation right now where there is merely a couple of dozen PGs on the 
disks I want to replace, which are all remapped and waiting to backfill - but 
there are 2200 other PGs also waiting to backfill because they've moved around 
too, and it's extremely frustating to be sat waiting to see when the ones I 
care about will finally be handled so I can get on with replacing those disks.


You could prioritize recovery on pool if that would work for you (as others 
wrote), or +1 this PR: https://github.com/ceph/ceph/pull/13723 (it's bit 
outdated as I'm constantly low on time, but I promise to push it forward!).


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Flash for mon nodes ?

2017-06-21 Thread Osama Hasebou
Hi Guys, 

Has anyone used flash SSD drives for nodes hosting Monitor nodes only? 

If yes, any major benefits against just using SAS drives ? 

Thanks. 

Regards, 
Ossi 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph packages for Debian Stretch?

2017-06-21 Thread Fabian Grünbichler
On Wed, Jun 21, 2017 at 05:30:02PM +0900, Christian Balzer wrote:
> 
> Hello,
> 
> On Wed, 21 Jun 2017 09:47:08 +0200 (CEST) Alexandre DERUMIER wrote:
> 
> > Hi,
> > 
> > Proxmox is maintening a ceph-luminous repo for stretch
> > 
> > http://download.proxmox.com/debian/ceph-luminous/
> > 
> > 
> > git is here, with patches and modifications to get it work
> > https://git.proxmox.com/?p=ceph.git;a=summary
> >
> While this is probably helpful for the changes needed, my quest is for
> Jewel (really all supported builds) for Stretch.
> And not whenever Luminous gets released, but within the next 10 days.

I think you should be able to just backport the needed commits from
http://tracker.ceph.com/issues/19884 on top of v10.2.7, bump the version
in debian/changelog and use dpkg-buildpackage (or wrapper of your
choice) to rebuild the packages. Building takes a while though ;)

Alternatively use the slightly outdated stock Debian packages (based on
10.2.5 with slightly deviating packaging and the patches in [1]) and
switch over to the official ones when they are available.

1: 
https://anonscm.debian.org/cgit/pkg-ceph/ceph.git/tree/debian/patches?id=7e85745cc7aece92e8f2e505285d451ec2210afa

> 
> Though clearly that's not going to happen, oh well.

Mismatched schedules between yourself and upstream can be cumbersome -
but at least in case of FLOSS you can always take matters into your own
hands and roll your own if the need is big enough ;)

> 
> Christian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph packages for Debian Stretch?

2017-06-21 Thread Christian Balzer

Hello,

On Wed, 21 Jun 2017 09:47:08 +0200 (CEST) Alexandre DERUMIER wrote:

> Hi,
> 
> Proxmox is maintening a ceph-luminous repo for stretch
> 
> http://download.proxmox.com/debian/ceph-luminous/
> 
> 
> git is here, with patches and modifications to get it work
> https://git.proxmox.com/?p=ceph.git;a=summary
>
While this is probably helpful for the changes needed, my quest is for
Jewel (really all supported builds) for Stretch.
And not whenever Luminous gets released, but within the next 10 days.

Though clearly that's not going to happen, oh well.

Christian
 
> 
> 
> - Mail original -
> De: "Alfredo Deza" 
> À: "Christian Balzer" 
> Cc: "ceph-users" 
> Envoyé: Mardi 20 Juin 2017 18:54:05
> Objet: Re: [ceph-users] Ceph packages for Debian Stretch?
> 
> On Mon, Jun 19, 2017 at 8:25 PM, Christian Balzer  wrote: 
> > 
> > Hello, 
> > 
> > can we have the status, projected release date of the Ceph packages for 
> > Debian Stretch?   
> 
> We don't have anything yet as a projected release date. 
> 
> The current status is that this has not been prioritized. I anticipate 
> that this will not be hard to accommodate in our repositories but 
> it will require quite the effort to add in all of our tooling. 
> 
> In case anyone would like to help us out before the next stable 
> release, these are places that would need to be updated for "stretch" 
> 
> https://github.com/ceph/ceph-build/tree/master/ceph-build 
> https://github.com/ceph/chacra 
> 
> "grepping" for "jessie" should indicate every spot that might need to 
> be updated. 
> 
> I am happy to review and answer questions to get these changes in! 
> 
> 
> > 
> > Christian 
> > -- 
> > Christian Balzer Network/Systems Engineer 
> > ch...@gol.com Rakuten Communications 
> > ___ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com   
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph packages for Debian Stretch?

2017-06-21 Thread Alexandre DERUMIER
Hi,

Proxmox is maintening a ceph-luminous repo for stretch

http://download.proxmox.com/debian/ceph-luminous/


git is here, with patches and modifications to get it work
https://git.proxmox.com/?p=ceph.git;a=summary



- Mail original -
De: "Alfredo Deza" 
À: "Christian Balzer" 
Cc: "ceph-users" 
Envoyé: Mardi 20 Juin 2017 18:54:05
Objet: Re: [ceph-users] Ceph packages for Debian Stretch?

On Mon, Jun 19, 2017 at 8:25 PM, Christian Balzer  wrote: 
> 
> Hello, 
> 
> can we have the status, projected release date of the Ceph packages for 
> Debian Stretch? 

We don't have anything yet as a projected release date. 

The current status is that this has not been prioritized. I anticipate 
that this will not be hard to accommodate in our repositories but 
it will require quite the effort to add in all of our tooling. 

In case anyone would like to help us out before the next stable 
release, these are places that would need to be updated for "stretch" 

https://github.com/ceph/ceph-build/tree/master/ceph-build 
https://github.com/ceph/chacra 

"grepping" for "jessie" should indicate every spot that might need to 
be updated. 

I am happy to review and answer questions to get these changes in! 


> 
> Christian 
> -- 
> Christian Balzer Network/Systems Engineer 
> ch...@gol.com Rakuten Communications 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs

2017-06-21 Thread Piotr Dałek

On 17-06-14 03:44 PM, Sage Weil wrote:

On Wed, 14 Jun 2017, Paweł Sadowski wrote:

On 04/13/2017 04:23 PM, Piotr Dałek wrote:

On 04/06/2017 03:25 PM, Sage Weil wrote:

On Thu, 6 Apr 2017, Piotr Dałek wrote:

[snip]


I think the solution here is to use sparse_read during recovery.  The
PushOp data representation already supports it; it's just a matter of
skipping the zeros.  The recovery code could also have an option to
check
for fully-zero regions of the data and turn those into holes as
well.  For
ReplicatedBackend, see build_push_op().


So far it turns out that there's even easier solution, we just enabled
"filestore seek hole" on some test cluster and that seems to fix the
problem for us. We'll see if fiemap works too.



Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?


We would need to enable it in the qa environment first.  The risk here is
that users run a broad range of kernels and we are exposing ourselves to
any bugs in any kernel version they may run.  I'd prefer to leave it off
by default.


That's a common regression? If not, we could blacklist particular kernels 
and call it a day.

 > We can enable it in the qa suite, though, which covers

centos7 (latest kernel) and ubuntu xenial and trusty.


+1. Do you need some particular PR for that?


I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?


BlueStore exposes the same sparseness metadata that enabling the
filestore seek hole or fiemap options does, so it won't be a problem
there.

I think the only thing that we could potentially add is zero detection
on writes (so that explicitly writing zeros consumes no space).  We'd
have to be a bit careful measuring the performance impact of that check on
non-zero writes.


I saw that RBD (librbd) does that - replacing writes with discards when 
buffer contains only zeros. Some code that does the same in librados could 
be added and it shouldn't impact performance much, current implementation of 
 mem_is_zero is fast and shouldn't be a big problem.


--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com