[ceph-users] Glance client and RBD export checksum mismatch

2019-04-09 Thread Brayan Perera
Dear All,

Ceph Version : 12.2.5-2.ge988fb6.el7

We are facing an issue on glance which have backend set to ceph, when
we try to create an instance or volume out of an image, it throws
checksum error.
When we use rbd export and use md5sum, value is matching with glance checksum.

When we use following script, it provides same error checksum as glance.

We have used below images for testing.
1. Failing image (checksum mismatch): ffed4088-74e1-4f22-86cb-35e7e97c377c
2. Passing image (checksum identical): c048f0f9-973d-4285-9397-939251c80a84

Output from storage node:

1. Failing image: ffed4088-74e1-4f22-86cb-35e7e97c377c
checksum from glance database: 34da2198ec7941174349712c6d2096d8
[root@storage01moc ~]# python test_rbd_format.py
ffed4088-74e1-4f22-86cb-35e7e97c377c admin
Image size: 681181184
checksum from ceph: b82d85ae5160a7b74f52be6b5871f596
Remarks: checksum is different

2. Passing image: c048f0f9-973d-4285-9397-939251c80a84
checksum from glance database: 4f977f748c9ac2989cff32732ef740ed
[root@storage01moc ~]# python test_rbd_format.py
c048f0f9-973d-4285-9397-939251c80a84 admin
Image size: 1411121152
checksum from ceph: 4f977f748c9ac2989cff32732ef740ed
Remarks: checksum is identical

Wondering whether this issue is from ceph python libs or from ceph itself.

Please note that we do not have ceph pool tiering configured.

Please let us know whether anyone faced similar issue and any fixes for this.

test_rbd_format.py
===
import rados, sys, rbd

image_id = sys.argv[1]
try:
rados_id = sys.argv[2]
except:
rados_id = 'openstack'


class ImageIterator(object):
"""
Reads data from an RBD image, one chunk at a time.
"""

def __init__(self, conn, pool, name, snapshot, store, chunk_size='8'):
self.pool = pool
self.conn = conn
self.name = name
self.snapshot = snapshot
self.chunk_size = chunk_size
self.store = store

def __iter__(self):
try:
with conn.open_ioctx(self.pool) as ioctx:
with rbd.Image(ioctx, self.name,
   snapshot=self.snapshot) as image:
img_info = image.stat()
size = img_info['size']
bytes_left = size
while bytes_left > 0:
length = min(self.chunk_size, bytes_left)
data = image.read(size - bytes_left, length)
bytes_left -= len(data)
yield data
raise StopIteration()
except rbd.ImageNotFound:
raise exceptions.NotFound(
_('RBD image %s does not exist') % self.name)

conn = rados.Rados(conffile='/etc/ceph/ceph.conf',rados_id=rados_id)
conn.connect()


with conn.open_ioctx('images') as ioctx:
try:
with rbd.Image(ioctx, image_id,
   snapshot='snap') as image:
img_info = image.stat()
print "Image size: %s " % img_info['size']
iter, size = (ImageIterator(conn, 'images', image_id,
'snap', 'rbd'), img_info['size'])
import six, hashlib
md5sum = hashlib.md5()
for chunk in iter:
if isinstance(chunk, six.string_types):
chunk = six.b(chunk)
md5sum.update(chunk)
md5sum = md5sum.hexdigest()
print "checksum from ceph: " + md5sum
except:
raise
===


Thank You !

-- 
Best Regards,
Brayan Perera
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to trigger offline filestore merge

2019-04-09 Thread Dan van der Ster
Hi again,

Thanks to a hint from another user I seem to have gotten past this.

The trick was to restart the osds with a positive merge threshold (10)
then cycle through rados bench several hundred times, e.g.

   while true ; do rados bench -p default.rgw.buckets.index 10 write
-b 4096 -t 128; sleep 5 ; done

After running that for awhile the PG filestore structure has merged
down and now listing the pool and backfilling are back to normal.

Thanks!

Dan


On Tue, Apr 9, 2019 at 7:05 PM Dan van der Ster  wrote:
>
> Hi all,
>
> We have a slight issue while trying to migrate a pool from filestore
> to bluestore.
>
> This pool used to have 20 million objects in filestore -- it now has
> 50,000. During its life, the filestore pgs were internally split
> several times, but never merged. Now the pg _head dirs have mostly
> empty directories.
> This creates some problems:
>
>   1. rados ls -p  hangs a long time, eventually triggering slow
> requests while the filestore_op threads time out. (They time out while
> listing the collections).
>   2. backfilling from these PGs is impossible, similarly because
> listing the objects to backfill eventually leads to the osd flapping.
>
> So I want to merge the filestore pgs.
>
> I tried ceph-objectstore-tool --op apply-layout-settings, but it seems
> that this only splits, not merges?
>
> Does someone have a better idea?
>
> Thanks!
>
> Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS-Ganesha Mounts as a Read-Only Filesystem

2019-04-09 Thread Paul Emmerich
Looks like you are trying to write to the pseudo-root, mount /cephfs
instead of /.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Apr 6, 2019 at 1:07 PM  wrote:
>
> Hi all,
>
>
>
> I have recently setup a Ceph cluster and on request using CephFS (MDS 
> version: ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)) as a backend for NFS-Ganesha. I have successfully tested a direct 
> mount with CephFS to read/write files, however I’m perplexed as to NFS 
> mounting as read-only despite setting the RW flags.
>
>
>
> [root@mon02 mnt]# touch cephfs/test.txt
>
> touch: cannot touch âcephfs/test.txtâ: Read-only file system
>
>
>
> Configuration of Ganesha is below:
>
>
>
> NFS_CORE_PARAM
>
> {
>
>   Enable_NLM = false;
>
>   Enable_RQUOTA = false;
>
>   Protocols = 4;
>
> }
>
>
>
> NFSv4
>
> {
>
>   Delegations = true;
>
>   RecoveryBackend = rados_ng;
>
>   Minor_Versions =  1,2;
>
> }
>
>
>
> CACHEINODE {
>
>   Dir_Chunk = 0;
>
>   NParts = 1;
>
>   Cache_Size = 1;
>
> }
>
>
>
> EXPORT
>
> {
>
> Export_ID = 15;
>
> Path = "/";
>
> Pseudo = "/cephfs/";
>
> Access_Type = RW;
>
> NFS_Protocols = "4";
>
> Squash = No_Root_Squash;
>
> Transport_Protocols = TCP;
>
> SecType = "none";
>
> Attr_Expiration_Time = 0;
>
> Delegations = R;
>
>
>
> FSAL {
>
> Name = CEPH;
>
>  User_Id = "ganesha";
>
>  Filesystem = "cephfs";
>
>  Secret_Access_Key = "";
>
> }
>
> }
>
>
>
>
>
> Provided mount parameters:
>
>
>
> mount -t nfs -o nfsvers=4.1,proto=tcp,rw,noatime,sync 172.16.32.15:/ 
> /mnt/cephfs
>
>
>
> I have tried stripping much of the config and altering mount options, but so 
> far completely unable to decipher the cause. Also seems I’m not the only one 
> who has been caught on this:
>
>
>
> https://www.spinics.net/lists/ceph-devel/msg41201.html
>
>
>
> Thanks in advance,
>
>
>
> Thomas
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs caused by omap_digest mismatch

2019-04-09 Thread Bryan Stillwell

> On Apr 8, 2019, at 5:42 PM, Bryan Stillwell  wrote:
> 
> 
>> On Apr 8, 2019, at 4:38 PM, Gregory Farnum  wrote:
>> 
>> On Mon, Apr 8, 2019 at 3:19 PM Bryan Stillwell  
>> wrote:
>>> 
>>> There doesn't appear to be any correlation between the OSDs which would 
>>> point to a hardware issue, and since it's happening on two different 
>>> clusters I'm wondering if there's a race condition that has been fixed in a 
>>> later version?
>>> 
>>> Also, what exactly is the omap digest?  From what I can tell it appears to 
>>> be some kind of checksum for the omap data.  Is that correct?
>> 
>> Yeah; it's just a crc over the omap key-value data that's checked
>> during deep scrub. Same as the data digest.
>> 
>> I've not noticed any issues around this in Luminous but I probably
>> wouldn't have, so will have to leave it up to others if there are
>> fixes in since 12.2.8.
> 
> Thanks for adding some clarity to that Greg!
> 
> For some added information, this is what the logs reported earlier today:
> 
> 2019-04-08 11:46:15.610169 osd.504 osd.504 10.16.10.30:6804/8874 33 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 2019-04-08 11:46:15.610190 osd.504 osd.504 10.16.10.30:6804/8874 34 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x26a1241b != omap_digest 0x4c10ee76 from shard 504
> 
> I then tried deep scrubbing it again to see if the data was fine, but the 
> digest calculation was just having problems.  It came back with the same 
> problem with new digest values:
> 
> 2019-04-08 15:56:21.186291 osd.504 osd.504 10.16.10.30:6804/8874 49 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 2019-04-08 15:56:21.186313 osd.504 osd.504 10.16.10.30:6804/8874 50 : cluster 
> [ERR] 7.3 : soid 7:c09d46a1:::.dir.default.22333615.1861352:head omap_digest 
> 0x93bac8f != omap_digest 0 xab1b9c6f from shard 504
> 
> Which makes sense, but doesn’t explain why the omap data is getting out of 
> sync across multiple OSDs and clusters…
> 
> I’ll see what I can figure out tomorrow, but if anyone else has some hints I 
> would love to hear them.

I’ve dug into this more today and it appears that the omap data contains an 
extra entry on the OSDs with the mismatched omap digests.  I then searched the 
RGW logs and found that a DELETE happened shortly after the OSD booted, but the 
omap data wasn’t updated on that OSD so it became mismatched.

Here’s a timeline of the events which caused PG 7.9 to become inconsistent:

2019-04-04 14:37:34 - osd.492 marked itself down
2019-04-04 14:40:35 - osd.492 boot
2019-04-04 14:41:55 - DELETE call happened
2019-04-08 12:06:14 - omap_digest mismatch detected (pg 7.9 is 
active+clean+inconsistent, acting [492,546,523])

Here’s the timeline for PG 7.2b:

2019-04-03 13:54:17 - osd.488 marked itself down
2019-04-03 13:59:27 - osd.488 boot
2019-04-03 14:00:54 - DELETE call happened
2019-04-08 12:42:21 - omap_digest mismatch detected (pg 7.2b is 
active+clean+inconsistent, acting [488,511,541])

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] showing active config settings

2019-04-09 Thread solarflow99
I noticed when changing some settings, they appear to stay the same, for
example when trying to set this higher:

ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

It gives the usual warning about may need to restart, but it still has the
old value:

# ceph --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3


restarting the OSDs seems fairly intrusive for every configuration change.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] problems with pg down

2019-04-09 Thread ceph
Hi Fabio,
Did you resolve the issue?

A bit late, i know, but did you tried to restart  OSD 14? If 102 and 121 are 
fine i would also try to crush reweight 14 to 0.

Greetings
Mehmet 

Am 10. März 2019 19:26:57 MEZ schrieb Fabio Abreu :
>Hi Darius,
>
>Thanks for your reply !
>
>This happening after a disaster with an sata storage node, the osds 102
>and
>121 is up  .
>
>The information belllow is osd 14 log , do you recommend mark out of
>this
>cluster ?
>
>2019-03-10 17:36:17.654134 7f1991163700  0 -- 172.16.184.90:6800/589935
>>>
>:/0 pipe(0x555be7808800 sd=516 :6800 s=0 pgs=0 cs=0 l=0
>c=0x555be6720400).accept failed to getpeername (107) Transport endpoint
>is
>not connected
>2019-03-10 17:36:17.654660 7f1992d7f700  0 -- 172.16.184.90:6800/589935
>>>
>:/0 pipe(0x555be773f400 sd=536 :6800 s=0 pgs=0 cs=0 l=0
>c=0x555be6720700).accept failed to getpeername (107) Transport endpoint
>is
>not connected
>2019-03-10 17:36:17.654720 7f1993a8c700  0 -- 172.16.184.90:6800/589935
>>>
>172.16.184.92:6801/102 pipe(0x555be7807400 sd=542 :6800 s=0 pgs=0
>cs=0
>l=0 c=0x555be6720280).accept connect_seq 0 vs existing 0 state wait
>2019-03-10 17:36:17.654813 7f199095b700  0 -- 172.16.184.90:6800/589935
>>>
>:/0 pipe(0x555be6d8e000 sd=537 :6800 s=0 pgs=0 cs=0 l=0
>c=0x555be671ff80).accept failed to getpeername (107) Transport endpoint
>is
>not connected
>2019-03-10 17:36:17.654847 7f1992476700  0 -- 172.16.184.90:6800/589935
>>>
>172.16.184.95:6840/1537112 pipe(0x555be773e000 sd=533 :6800 s=0 pgs=0
>cs=0
>l=0 c=0x555be671fc80).accept connect_seq 0 vs existing 0 state wait
>2019-03-10 17:36:17.655252 7f1993486700  0 -- 172.16.184.90:6800/589935
>>>
>172.16.184.92:6832/1098862 pipe(0x555be779f400 sd=521 :6800 s=0 pgs=0
>cs=0
>l=0 c=0x555be6242d00).accept connect_seq 0 vs existing 0 state wait
>2019-03-10 17:36:17.655315 7f1993284700  0 -- 172.16.184.90:6800/589935
>>>
>:/0 pipe(0x555be6d90800 sd=523 :6800 s=0 pgs=0 cs=0 l=0
>c=0x555be6720880).accept failed to getpeername (107) Transport endpoint
>is
>not connected
>2019-03-10 17:36:17.655814 7f1992173700  0 -- 172.16.184.90:6800/589935
>>>
>172.16.184.91:6833/316673 pipe(0x555be7740800 sd=527 :6800 s=0 pgs=0
>cs=0
>l=0 c=0x555be6720580).accept connect_seq 0 vs existing 0 state wait
>
>Regards,
>Fabio Abreu
>
>On Sun, Mar 10, 2019 at 3:20 PM Darius Kasparavičius 
>wrote:
>
>> Hi,
>>
>> Check your osd.14 logs for information its currently stuck and not
>> providing io for replication. And what happened to OSD's 102 121?
>>
>> On Sun, Mar 10, 2019 at 7:44 PM Fabio Abreu
>
>> wrote:
>> >
>> > Hi Everybody .
>> >
>> > I have an pg with down+peering  state and that have requests
>blocked
>> impacting my pg query, I can't find the osd to apply the lost
>paremeter.
>> >
>> >
>>
>http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure
>> >
>> > Did someone  have  same  scenario with  state down?
>> >
>> > Storage :
>> >
>> > 100 ops are blocked > 262.144 sec on osd.14
>> >
>> > root@monitor:~# ceph pg dump_stuck inactive
>> > ok
>> > pg_stat state   up  up_primary  acting  acting_primary
>> > 5.6e0   down+remapped+peering   [102,121,14]102 [14]14
>> >
>> >
>> > root@monitor:~# ceph -s
>> > cluster xxx
>> >  health HEALTH_ERR
>> > 1 pgs are stuck inactive for more than 300 seconds
>> > 223 pgs backfill_wait
>> > 14 pgs backfilling
>> > 215 pgs degraded
>> > 1 pgs down
>> > 1 pgs peering
>> > 1 pgs recovering
>> > 53 pgs recovery_wait
>> > 199 pgs stuck degraded
>> > 1 pgs stuck inactive
>> > 278 pgs stuck unclean
>> > 162 pgs stuck undersized
>> > 162 pgs undersized
>> > 100 requests are blocked > 32 sec
>> > recovery 2767660/317878237 objects degraded (0.871%)
>> > recovery 7484106/317878237 objects misplaced (2.354%)
>> > recovery 29/105009626 unfoun
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Fabio Abreu Reis
>> > http://fajlinux.com.br
>> > Tel : +55 21 98244-0161
>> > Skype : fabioabreureis
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>-- 
>Atenciosamente,
>Fabio Abreu Reis
>http://fajlinux.com.br
>*Tel : *+55 21 98244-0161
>*Skype : *fabioabreureis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-09 Thread Francois Lafont

On 4/9/19 12:43 PM, Francois Lafont wrote:


2. In my Docker container context, is it possible to put the logs above in the file 
"/var/log/syslog" of my host, in other words is it possible to make sure to log this in 
stdout of the daemon "radosgw"?


In brief, is it possible log "operations" in a regular file or better for me in 
stdout?


--
flaf
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
Good point, thanks !

By making memory pressure (by playing with vm.min_free_kbytes), memory
is freed by the kernel.

So I think I essentially need to update monitoring rules, to avoid
false positive.

Thanks, I continue to read your resources.


Le mardi 09 avril 2019 à 09:30 -0500, Mark Nelson a écrit :
> My understanding is that basically the kernel is either unable or 
> uninterested (maybe due to lack of memory pressure?) in reclaiming
> the 
> memory .  It's possible you might have better behavior if you set 
> /sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or 
> maybe disable transparent huge pages entirely.
> 
> 
> Some background:
> 
> https://github.com/gperftools/gperftools/issues/1073
> 
> https://blog.nelhage.com/post/transparent-hugepages/
> 
> https://www.kernel.org/doc/Documentation/vm/transhuge.txt
> 
> 
> Mark
> 
> 
> On 4/9/19 7:31 AM, Olivier Bonvalet wrote:
> > Well, Dan seems to be right :
> > 
> > _tune_cache_size
> >  target: 4294967296
> >heap: 6514409472
> >unmapped: 2267537408
> >  mapped: 4246872064
> > old cache_size: 2845396873
> > new cache size: 2845397085
> > 
> > 
> > So we have 6GB in heap, but "only" 4GB mapped.
> > 
> > But "ceph tell osd.* heap release" should had release that ?
> > 
> > 
> > Thanks,
> > 
> > Olivier
> > 
> > 
> > Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :
> > > One of the difficulties with the osd_memory_target work is that
> > > we
> > > can't
> > > tune based on the RSS memory usage of the process. Ultimately
> > > it's up
> > > to
> > > the kernel to decide to reclaim memory and especially with
> > > transparent
> > > huge pages it's tough to judge what the kernel is going to do
> > > even
> > > if
> > > memory has been unmapped by the process.  Instead the autotuner
> > > looks
> > > at
> > > how much memory has been mapped and tries to balance the caches
> > > based
> > > on
> > > that.
> > > 
> > > 
> > > In addition to Dan's advice, you might also want to enable debug
> > > bluestore at level 5 and look for lines containing "target:" and
> > > "cache_size:".  These will tell you the current target, the
> > > mapped
> > > memory, unmapped memory, heap size, previous aggregate cache
> > > size,
> > > and
> > > new aggregate cache size.  The other line will give you a break
> > > down
> > > of
> > > how much memory was assigned to each of the bluestore caches and
> > > how
> > > much each case is using.  If there is a memory leak, the
> > > autotuner
> > > can
> > > only do so much.  At some point it will reduce the caches to fit
> > > within
> > > cache_min and leave it there.
> > > 
> > > 
> > > Mark
> > > 
> > > 
> > > On 4/8/19 5:18 AM, Dan van der Ster wrote:
> > > > Which OS are you using?
> > > > With CentOS we find that the heap is not always automatically
> > > > released. (You can check the heap freelist with `ceph tell
> > > > osd.0
> > > > heap
> > > > stats`).
> > > > As a workaround we run this hourly:
> > > > 
> > > > ceph tell mon.* heap release
> > > > ceph tell osd.* heap release
> > > > ceph tell mds.* heap release
> > > > 
> > > > -- Dan
> > > > 
> > > > On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
> > > > ceph.l...@daevel.fr> wrote:
> > > > > Hi,
> > > > > 
> > > > > on a Luminous 12.2.11 deploiement, my bluestore OSD exceed
> > > > > the
> > > > > osd_memory_target :
> > > > > 
> > > > > daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> > > > > ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
> > > > > 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
> > > > > 1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
> > > > > 1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
> > > > > 2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
> > > > > 1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
> > > > > 1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
> > > > > 1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
> > > > > 1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --
> > > > > setuser
> > > > > ceph --setgroup ceph
> > > > > 
> > > > > daevel-ob@ssdr712h:~$ free -m
> > > > 

[ceph-users] how to trigger offline filestore merge

2019-04-09 Thread Dan van der Ster
Hi all,

We have a slight issue while trying to migrate a pool from filestore
to bluestore.

This pool used to have 20 million objects in filestore -- it now has
50,000. During its life, the filestore pgs were internally split
several times, but never merged. Now the pg _head dirs have mostly
empty directories.
This creates some problems:

  1. rados ls -p  hangs a long time, eventually triggering slow
requests while the filestore_op threads time out. (They time out while
listing the collections).
  2. backfilling from these PGs is impossible, similarly because
listing the objects to backfill eventually leads to the osd flapping.

So I want to merge the filestore pgs.

I tried ceph-objectstore-tool --op apply-layout-settings, but it seems
that this only splits, not merges?

Does someone have a better idea?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Jason Dillaman
Can you pastebin the results from running the following on your backup
site rbd-mirror daemon node?

ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
ceph --admin-socket /path/to/asok rbd mirror restart nova
 wait a minute to let some logs accumulate ...
ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5

... and collect the rbd-mirror log from /var/log/ceph/ (should have
lots of "rbd::mirror"-like log entries.


On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund  wrote:
>
>
>
> Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman :
>>
>> Any chance your rbd-mirror daemon has the admin sockets available
>> (defaults to /var/run/ceph/cephdr-clientasok)? If
>> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
>
>
> {
> "pool_replayers": [
> {
> "pool": "glance",
> "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster: 
> production client: client.productionbackup",
> "instance_id": "869081",
> "leader_instance_id": "869081",
> "leader": true,
> "instances": [],
> "local_cluster_admin_socket": 
> "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
> "remote_cluster_admin_socket": 
> "/var/run/ceph/client.productionbackup.1936211.production.9422567521.asok",
> "sync_throttler": {
> "max_parallel_syncs": 5,
> "running_syncs": 0,
> "waiting_syncs": 0
> },
> "image_replayers": [
> {
> "name": "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
> "state": "Replaying"
> },
> {
> "name": "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
> "state": "Replaying"
> },
> ---cut--
> {
> "name": 
> "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
> "state": "Replaying"
> }
> ]
> },
>  {
> "pool": "nova",
> "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster: 
> production client: client.productionbackup",
> "instance_id": "889074",
> "leader_instance_id": "889074",
> "leader": true,
> "instances": [],
> "local_cluster_admin_socket": 
> "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
> "remote_cluster_admin_socket": 
> "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
> "sync_throttler": {
> "max_parallel_syncs": 5,
> "running_syncs": 0,
> "waiting_syncs": 0
> },
> "image_replayers": []
> }
> ],
> "image_deleter": {
> "image_deleter_status": {
> "delete_images_queue": [
> {
> "local_pool_id": 3,
> "global_image_id": "ff531159-de6f-4324-a022-50c079dedd45"
> }
> ],
> "failed_deletes_queue": []
> }
>>
>>
>> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund  wrote:
>> >
>> >
>> >
>> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman :
>> >>
>> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund  
>> >> wrote:
>> >> >
>> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund  
>> >> > >wrote:
>> >> > >>
>> >> > >> Hi,
>> >> > >> We have configured one-way replication of pools between a production 
>> >> > >> cluster and a backup cluster. But unfortunately the rbd-mirror or 
>> >> > >> the backup cluster is unable to keep up with the production cluster 
>> >> > >> so the replication fails to reach replaying state.
>> >> > >
>> >> > >Hmm, it's odd that they don't at least reach the replaying state. Are
>> >> > >they still performing the initial sync?
>> >> >
>> >> > There are three pools we try to mirror, (glance, cinder, and nova, no 
>> >> > points for guessing what the cluster is used for :) ),
>> >> > the glance and cinder pools are smaller and sees limited write 
>> >> > activity, and the mirroring works, the nova pool which is the largest 
>> >> > and has 90% of the write activity never leaves the "unknown" state.
>> >> >
>> >> > # rbd mirror pool status cinder
>> >> > health: OK
>> >> > images: 892 total
>> >> > 890 replaying
>> >> > 2 stopped
>> >> > #
>> >> > # rbd mirror pool status nova
>> >> > health: WARNING
>> >> > images: 2479 total
>> >> > 2479 unknown
>> >> > #
>> >> > The production clsuter has 5k writes/s on average and the backup 
>> >> > cluster has 1-2k writes/s on average. The production cluster is bigger 
>> >> > and has better specs. I thought that the backup cluster would be able 
>> >> > to keep up but it looks like I was wrong.
>> >>
>> >> The fact that they are in the unknown state just means that 

Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Magnus Grönlund
Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman :

> Any chance your rbd-mirror daemon has the admin sockets available
> (defaults to /var/run/ceph/cephdr-clientasok)? If
> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
>

{
"pool_replayers": [
{
"pool": "glance",
"peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster:
production client: client.productionbackup",
"instance_id": "869081",
"leader_instance_id": "869081",
"leader": true,
"instances": [],
"local_cluster_admin_socket":
"/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
"remote_cluster_admin_socket":
"/var/run/ceph/client.productionbackup.1936211.production.9422567521.asok",
"sync_throttler": {
"max_parallel_syncs": 5,
"running_syncs": 0,
"waiting_syncs": 0
},
"image_replayers": [
{
"name": "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
"state": "Replaying"
},
{
"name": "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
"state": "Replaying"
},
---cut--
{
"name":
"cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
"state": "Replaying"
}
]
},
 {
"pool": "nova",
"peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster:
production client: client.productionbackup",
"instance_id": "889074",
"leader_instance_id": "889074",
"leader": true,
"instances": [],
"local_cluster_admin_socket":
"/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
"remote_cluster_admin_socket":
"/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
"sync_throttler": {
"max_parallel_syncs": 5,
"running_syncs": 0,
"waiting_syncs": 0
},
"image_replayers": []
}
],
"image_deleter": {
"image_deleter_status": {
"delete_images_queue": [
{
"local_pool_id": 3,
"global_image_id":
"ff531159-de6f-4324-a022-50c079dedd45"
}
],
"failed_deletes_queue": []
}

>
> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund 
> wrote:
> >
> >
> >
> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman :
> >>
> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund 
> wrote:
> >> >
> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund 
> wrote:
> >> > >>
> >> > >> Hi,
> >> > >> We have configured one-way replication of pools between a
> production cluster and a backup cluster. But unfortunately the rbd-mirror
> or the backup cluster is unable to keep up with the production cluster so
> the replication fails to reach replaying state.
> >> > >
> >> > >Hmm, it's odd that they don't at least reach the replaying state. Are
> >> > >they still performing the initial sync?
> >> >
> >> > There are three pools we try to mirror, (glance, cinder, and nova, no
> points for guessing what the cluster is used for :) ),
> >> > the glance and cinder pools are smaller and sees limited write
> activity, and the mirroring works, the nova pool which is the largest and
> has 90% of the write activity never leaves the "unknown" state.
> >> >
> >> > # rbd mirror pool status cinder
> >> > health: OK
> >> > images: 892 total
> >> > 890 replaying
> >> > 2 stopped
> >> > #
> >> > # rbd mirror pool status nova
> >> > health: WARNING
> >> > images: 2479 total
> >> > 2479 unknown
> >> > #
> >> > The production clsuter has 5k writes/s on average and the backup
> cluster has 1-2k writes/s on average. The production cluster is bigger and
> has better specs. I thought that the backup cluster would be able to keep
> up but it looks like I was wrong.
> >>
> >> The fact that they are in the unknown state just means that the remote
> >> "rbd-mirror" daemon hasn't started any journal replayers against the
> >> images. If it couldn't keep up, it would still report a status of
> >> "up+replaying". What Ceph release are you running on your backup
> >> cluster?
> >>
> > The backup cluster is running Luminous 12.2.11 (the production cluster
> 12.2.10)
> >
> >>
> >> > >> And the journals on the rbd volumes keep growing...
> >> > >>
> >> > >> Is it enough to simply disable the mirroring of the pool  (rbd
> mirror pool disable ) and that will remove the lagging reader from
> the journals and shrink them, or is there anything else that has to be done?
> >> > >
> >> > >You can either disable the journaling feature on the image(s) since
> >> > >there is no point to leave it on if you aren't using 

Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Jason Dillaman
Any chance your rbd-mirror daemon has the admin sockets available
(defaults to /var/run/ceph/cephdr-clientasok)? If
so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".

On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund  wrote:
>
>
>
> Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman :
>>
>> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund  wrote:
>> >
>> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund  
>> > >wrote:
>> > >>
>> > >> Hi,
>> > >> We have configured one-way replication of pools between a production 
>> > >> cluster and a backup cluster. But unfortunately the rbd-mirror or the 
>> > >> backup cluster is unable to keep up with the production cluster so the 
>> > >> replication fails to reach replaying state.
>> > >
>> > >Hmm, it's odd that they don't at least reach the replaying state. Are
>> > >they still performing the initial sync?
>> >
>> > There are three pools we try to mirror, (glance, cinder, and nova, no 
>> > points for guessing what the cluster is used for :) ),
>> > the glance and cinder pools are smaller and sees limited write activity, 
>> > and the mirroring works, the nova pool which is the largest and has 90% of 
>> > the write activity never leaves the "unknown" state.
>> >
>> > # rbd mirror pool status cinder
>> > health: OK
>> > images: 892 total
>> > 890 replaying
>> > 2 stopped
>> > #
>> > # rbd mirror pool status nova
>> > health: WARNING
>> > images: 2479 total
>> > 2479 unknown
>> > #
>> > The production clsuter has 5k writes/s on average and the backup cluster 
>> > has 1-2k writes/s on average. The production cluster is bigger and has 
>> > better specs. I thought that the backup cluster would be able to keep up 
>> > but it looks like I was wrong.
>>
>> The fact that they are in the unknown state just means that the remote
>> "rbd-mirror" daemon hasn't started any journal replayers against the
>> images. If it couldn't keep up, it would still report a status of
>> "up+replaying". What Ceph release are you running on your backup
>> cluster?
>>
> The backup cluster is running Luminous 12.2.11 (the production cluster 
> 12.2.10)
>
>>
>> > >> And the journals on the rbd volumes keep growing...
>> > >>
>> > >> Is it enough to simply disable the mirroring of the pool  (rbd mirror 
>> > >> pool disable ) and that will remove the lagging reader from the 
>> > >> journals and shrink them, or is there anything else that has to be done?
>> > >
>> > >You can either disable the journaling feature on the image(s) since
>> > >there is no point to leave it on if you aren't using mirroring, or run
>> > >"rbd mirror pool disable " to purge the journals.
>> >
>> > Thanks for the confirmation.
>> > I will stop the mirror of the nova pool and try to figure out if there is 
>> > anything we can do to get the backup cluster to keep up.
>> >
>> > >> Best regards
>> > >> /Magnus
>> > >> ___
>> > >> ceph-users mailing list
>> > >> ceph-users@lists.ceph.com
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > >--
>> > >Jason
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Magnus Grönlund
Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman :

> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund 
> wrote:
> >
> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund 
> wrote:
> > >>
> > >> Hi,
> > >> We have configured one-way replication of pools between a production
> cluster and a backup cluster. But unfortunately the rbd-mirror or the
> backup cluster is unable to keep up with the production cluster so the
> replication fails to reach replaying state.
> > >
> > >Hmm, it's odd that they don't at least reach the replaying state. Are
> > >they still performing the initial sync?
> >
> > There are three pools we try to mirror, (glance, cinder, and nova, no
> points for guessing what the cluster is used for :) ),
> > the glance and cinder pools are smaller and sees limited write activity,
> and the mirroring works, the nova pool which is the largest and has 90% of
> the write activity never leaves the "unknown" state.
> >
> > # rbd mirror pool status cinder
> > health: OK
> > images: 892 total
> > 890 replaying
> > 2 stopped
> > #
> > # rbd mirror pool status nova
> > health: WARNING
> > images: 2479 total
> > 2479 unknown
> > #
> > The production clsuter has 5k writes/s on average and the backup cluster
> has 1-2k writes/s on average. The production cluster is bigger and has
> better specs. I thought that the backup cluster would be able to keep up
> but it looks like I was wrong.
>
> The fact that they are in the unknown state just means that the remote
> "rbd-mirror" daemon hasn't started any journal replayers against the
> images. If it couldn't keep up, it would still report a status of
> "up+replaying". What Ceph release are you running on your backup
> cluster?
>
> The backup cluster is running Luminous 12.2.11 (the production cluster
12.2.10)


> > >> And the journals on the rbd volumes keep growing...
> > >>
> > >> Is it enough to simply disable the mirroring of the pool  (rbd mirror
> pool disable ) and that will remove the lagging reader from the
> journals and shrink them, or is there anything else that has to be done?
> > >
> > >You can either disable the journaling feature on the image(s) since
> > >there is no point to leave it on if you aren't using mirroring, or run
> > >"rbd mirror pool disable " to purge the journals.
> >
> > Thanks for the confirmation.
> > I will stop the mirror of the nova pool and try to figure out if there
> is anything we can do to get the backup cluster to keep up.
> >
> > >> Best regards
> > >> /Magnus
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >--
> > >Jason
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Jason Dillaman
On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund  wrote:
>
> >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund  wrote:
> >>
> >> Hi,
> >> We have configured one-way replication of pools between a production 
> >> cluster and a backup cluster. But unfortunately the rbd-mirror or the 
> >> backup cluster is unable to keep up with the production cluster so the 
> >> replication fails to reach replaying state.
> >
> >Hmm, it's odd that they don't at least reach the replaying state. Are
> >they still performing the initial sync?
>
> There are three pools we try to mirror, (glance, cinder, and nova, no points 
> for guessing what the cluster is used for :) ),
> the glance and cinder pools are smaller and sees limited write activity, and 
> the mirroring works, the nova pool which is the largest and has 90% of the 
> write activity never leaves the "unknown" state.
>
> # rbd mirror pool status cinder
> health: OK
> images: 892 total
> 890 replaying
> 2 stopped
> #
> # rbd mirror pool status nova
> health: WARNING
> images: 2479 total
> 2479 unknown
> #
> The production clsuter has 5k writes/s on average and the backup cluster has 
> 1-2k writes/s on average. The production cluster is bigger and has better 
> specs. I thought that the backup cluster would be able to keep up but it 
> looks like I was wrong.

The fact that they are in the unknown state just means that the remote
"rbd-mirror" daemon hasn't started any journal replayers against the
images. If it couldn't keep up, it would still report a status of
"up+replaying". What Ceph release are you running on your backup
cluster?

> >> And the journals on the rbd volumes keep growing...
> >>
> >> Is it enough to simply disable the mirroring of the pool  (rbd mirror pool 
> >> disable ) and that will remove the lagging reader from the journals 
> >> and shrink them, or is there anything else that has to be done?
> >
> >You can either disable the journaling feature on the image(s) since
> >there is no point to leave it on if you aren't using mirroring, or run
> >"rbd mirror pool disable " to purge the journals.
>
> Thanks for the confirmation.
> I will stop the mirror of the nova pool and try to figure out if there is 
> anything we can do to get the backup cluster to keep up.
>
> >> Best regards
> >> /Magnus
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >--
> >Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Magnus Grönlund
>On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund  wrote:
>>
>> Hi,
>> We have configured one-way replication of pools between a production
cluster and a backup cluster. But unfortunately the rbd-mirror or the
backup cluster is unable to keep up with the production cluster so the
replication fails to reach replaying state.
>
>Hmm, it's odd that they don't at least reach the replaying state. Are
>they still performing the initial sync?

There are three pools we try to mirror, (glance, cinder, and nova, no
points for guessing what the cluster is used for :) ),
the glance and cinder pools are smaller and sees limited write activity,
and the mirroring works, the nova pool which is the largest and has 90% of
the write activity never leaves the "unknown" state.

# rbd mirror pool status cinder
health: OK
images: 892 total
890 replaying
2 stopped
#
# rbd mirror pool status nova
health: WARNING
images: 2479 total
2479 unknown
#
The production clsuter has 5k writes/s on average and the backup cluster
has 1-2k writes/s on average. The production cluster is bigger and has
better specs. I thought that the backup cluster would be able to keep up
but it looks like I was wrong.

>> And the journals on the rbd volumes keep growing...
>>
>> Is it enough to simply disable the mirroring of the pool  (rbd mirror
pool disable ) and that will remove the lagging reader from the
journals and shrink them, or is there anything else that has to be done?
>
>You can either disable the journaling feature on the image(s) since
>there is no point to leave it on if you aren't using mirroring, or run
>"rbd mirror pool disable " to purge the journals.

Thanks for the confirmation.
I will stop the mirror of the nova pool and try to figure out if there is
anything we can do to get the backup cluster to keep up.

>> Best regards
>> /Magnus
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>--
>Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to tune Ceph RBD mirroring parameters to speed up replication

2019-04-09 Thread Jason Dillaman
On Thu, Apr 4, 2019 at 6:27 AM huxia...@horebdata.cn
 wrote:
>
> thanks a lot, Jason.
>
> how much performance loss should i expect by enabling rbd mirroring? I really 
> need to minimize any performance impact while using this disaster recovery 
> feature. Will a dedicated journal on Intel Optane NVMe help? If so, how big 
> the size should be?

The worst-case impact is effectively double the write latency and
bandwidth (since the librbd client needs to journal the IO first
before committing the actual changes to the image). I would definitely
recommend using a separate fast pool for the journal to minimum the
initial journal write latency hit. The librbd in-memory cache in
writeback mode can also help since it can help absorb the additional
latency since the write IO can be (effectively) immediately ACKed if
you have enough space in the cache.

> cheers,
>
> Samuel
>
> 
> huxia...@horebdata.cn
>
>
> From: Jason Dillaman
> Date: 2019-04-03 23:03
> To: huxia...@horebdata.cn
> CC: ceph-users
> Subject: Re: [ceph-users] How to tune Ceph RBD mirroring parameters to speed 
> up replication
> For better or worse, out of the box, librbd and rbd-mirror are
> configured to conserve memory at the expense of performance to support
> the potential case of thousands of images being mirrored and only a
> single "rbd-mirror" daemon attempting to handle the load.
>
> You can optimize writes by adding "rbd_journal_max_payload_bytes =
> 8388608" to the "[client]" section on the librbd client nodes.
> Normally, writes larger than 16KiB are broken into multiple journal
> entries to allow the remote "rbd-mirror" daemon to make forward
> progress w/o using too much memory, so this will ensure large IOs only
> require a single journal entry.
>
> You can also add "rbd_mirror_journal_max_fetch_bytes = 33554432" to
> the "[client]" section on the "rbd-mirror" daemon nodes and restart
> the daemon for the change to take effect. Normally, the daemon tries
> to nibble the per-image journal events to prevent excessive memory use
> in the case where potentially thousands of images are being mirrored.
>
> On Wed, Apr 3, 2019 at 4:34 PM huxia...@horebdata.cn
>  wrote:
> >
> > Hello, folks,
> >
> > I am setting up two ceph clusters to test async replication via RBD 
> > mirroring. The two clusters are very close, just in two buildings about 20m 
> > away, and the networking is very good as well, 10Gb Fiber connection. In 
> > this case, how should i tune the relevant RBD mirroring parameters to 
> > accelerate the replication?
> >
> > thanks in advance,
> >
> > Samuel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remove RBD mirror?

2019-04-09 Thread Jason Dillaman
On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund  wrote:
>
> Hi,
> We have configured one-way replication of pools between a production cluster 
> and a backup cluster. But unfortunately the rbd-mirror or the backup cluster 
> is unable to keep up with the production cluster so the replication fails to 
> reach replaying state.

Hmm, it's odd that they don't at least reach the replaying state. Are
they still performing the initial sync?

> And the journals on the rbd volumes keep growing...
>
> Is it enought to simply disable the mirroring of the pool  (rbd mirror pool 
> disable ) and that will remove the lagging reader from the journals and 
> shrink them, or is there any thing else that has to be done?

You can either disable the journaling feature on the image(s) since
there is no point to leave it on if you aren't using mirroring, or run
"rbd mirror pool disable " to purge the journals.

> Best regards
> /Magnus
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Remove RBD mirror?

2019-04-09 Thread Magnus Grönlund
Hi,
We have configured one-way replication of pools between a production
cluster and a backup cluster. But unfortunately the rbd-mirror or the
backup cluster is unable to keep up with the production cluster so the
replication fails to reach replaying state.
And the journals on the rbd volumes keep growing...

Is it enought to simply disable the mirroring of the pool  (rbd mirror pool
disable ) and that will remove the lagging reader from the journals
and shrink them, or is there any thing else that has to be done?

Best regards
/Magnus
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BADAUTHORIZER in Nautilus

2019-04-09 Thread Shawn Edwards
Update:

I think we have a work-around, but no root cause yet.

What is working is removing the 'v2' bits from the ceph.conf file across
the cluster, and turning off all cephx authentication.  Now everything
seems to be talking correctly other than some odd metrics around the edges.

Here's my current ceph.conf, running on all ceph hosts and clients:

[global]
fsid = 3f390b5e-2b1d-4a2f-ba00-
mon_host = [v1:10.36.9.43:6789/0] [v1:10.36.9.44:6789/0] [v1:
10.36.9.45:6789/0]
auth_client_required = none
auth_cluster_required = none
auth_service_required = none

If we get better information as to what's going on, I'll post here for
future reference


On Thu, Apr 4, 2019 at 9:16 AM Sage Weil  wrote:

> On Thu, 4 Apr 2019, Shawn Edwards wrote:
> > It was disabled in a fit of genetic debugging.  I've now tried to revert
> > all config settings related to auth and signing to defaults.
> >
> > I can't seem to change the auth_*_required settings.  If I try to remove
> > them, they stay set.  If I try to change them, I get both the old and new
> > settings:
> >
> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> > globaladvanced auth_client_required   cephx
> > *
> > globaladvanced auth_cluster_required  cephx
> > *
> > globaladvanced auth_service_required  cephx
> > *
> > root@tyr-ceph-mon0:~# ceph config rm global auth_service_required
> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> > globaladvanced auth_client_required   cephx
> > *
> > globaladvanced auth_cluster_required  cephx
> > *
> > globaladvanced auth_service_required  cephx
> > *
> > root@tyr-ceph-mon0:~# ceph config set global auth_service_required none
> > root@tyr-ceph-mon0:~# ceph config dump | grep -E '(auth|cephx)'
> > globaladvanced auth_client_required   cephx
> > *
> > globaladvanced auth_cluster_required  cephx
> > *
> > globaladvanced auth_service_required  none
> >*
> > globaladvanced auth_service_required  cephx
> > *
> >
> > I know these are set to RO, but according to your blog posts, this means
> > they don't get updated until a daemon restart.  Does this look correct to
> > you?  I'm assuming I need to restart all daemons on all hosts.  Is this
> > correct?
>
> Yeah, that is definitely not behaving properly.  Can you try "ceph
> config-key dump | grep config/" to look at how those keys are stored?  You
> should see something like
>
> "config/auth_cluster_required": "cephx",
> "config/auth_service_required": "cephx",
> "config/auth_service_ticket_ttl": "3600.00",
>
> but maybe those names are formed differently, maybe with ".../global/..."
> in there?  My guess is a subtle naming behavior change between mimic or
> something.  You can remove the keys via the config-key interface and then
> restart the mons (or adjust any random config option) to make the
> mons refresh.  After that config dump should show the right thing.
>
> Maybe a disagreement/confusion about the actual value of
> auth_service_ticket_ttl is the cause of this.  You might try doing 'ceph
> config show osd.0' and/or a mon to see what value for the auth options the
> daemons are actually using and reporting...
>
> sage
>
>
> >
> > On Thu, Apr 4, 2019 at 5:54 AM Sage Weil  wrote:
> >
> > > That log shows
> > >
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: tick
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 cephx: validate_tickets want 53
> > > have 53 need 0
> > > 2019-04-03 15:39:53.299 7f3733f18700 20 cephx client: need_tickets:
> > > want=53 have=53 need=0
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 monclient: _check_auth_rotating
> > > have uptodate secrets (they expire after 2019-04-03 15:39:23.301595)
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth: dump_rotating:
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41691 A4Q== expires
> > > 2019-04-03 14:43:07.042860
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41692 AD9Q== expires
> > > 2019-04-03 15:43:09.895511
> > > 2019-04-03 15:39:53.299 7f3733f18700 10 auth:  id 41693 ADQ== expires
> > > 2019-04-03 16:43:09.895511
> > >
> > > which is all fine.  It is getting BADAUTHORIZER talking to another OSD,
> > > but I'm guessing it's because that other OSD doesn't have the right
> > > tickets.  It's hard to tell what's wrong without having al the OSD logs
> > > 

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Mark Nelson
My understanding is that basically the kernel is either unable or 
uninterested (maybe due to lack of memory pressure?) in reclaiming the 
memory .  It's possible you might have better behavior if you set 
/sys/kernel/mm/khugepaged/max_ptes_none to a low value (maybe 0) or 
maybe disable transparent huge pages entirely.



Some background:

https://github.com/gperftools/gperftools/issues/1073

https://blog.nelhage.com/post/transparent-hugepages/

https://www.kernel.org/doc/Documentation/vm/transhuge.txt


Mark


On 4/9/19 7:31 AM, Olivier Bonvalet wrote:

Well, Dan seems to be right :

_tune_cache_size
 target: 4294967296
   heap: 6514409472
   unmapped: 2267537408
 mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :

One of the difficulties with the osd_memory_target work is that we
can't
tune based on the RSS memory usage of the process. Ultimately it's up
to
the kernel to decide to reclaim memory and especially with
transparent
huge pages it's tough to judge what the kernel is going to do even
if
memory has been unmapped by the process.  Instead the autotuner looks
at
how much memory has been mapped and tries to balance the caches based
on
that.


In addition to Dan's advice, you might also want to enable debug
bluestore at level 5 and look for lines containing "target:" and
"cache_size:".  These will tell you the current target, the mapped
memory, unmapped memory, heap size, previous aggregate cache size,
and
new aggregate cache size.  The other line will give you a break down
of
how much memory was assigned to each of the bluestore caches and how
much each case is using.  If there is a memory leak, the autotuner
can
only do so much.  At some point it will reduce the caches to fit
within
cache_min and leave it there.


Mark


On 4/8/19 5:18 AM, Dan van der Ster wrote:

Which OS are you using?
With CentOS we find that the heap is not always automatically
released. (You can check the heap freelist with `ceph tell osd.0
heap
stats`).
As a workaround we run this hourly:

ceph tell mon.* heap release
ceph tell osd.* heap release
ceph tell mds.* heap release

-- Dan

On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
ceph.l...@daevel.fr> wrote:

Hi,

on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
osd_memory_target :

daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser
ceph --setgroup ceph
ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser
ceph --setgroup ceph
ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser
ceph --setgroup ceph
ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser
ceph --setgroup ceph
ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser
ceph --setgroup ceph
ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser
ceph --setgroup ceph
ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser
ceph --setgroup ceph
ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser
ceph --setgroup ceph

daevel-ob@ssdr712h:~$ free -m
totalusedfree  shared  buff/ca
che   available
Mem:  47771   452101643  17 9
17   43556
Swap: 0   0   0

# ceph daemon osd.147 config show | grep memory_target
  "osd_memory_target": "4294967296",


And there is no recovery / backfilling, the cluster is fine :

 $ ceph status
   cluster:
 id: de035250-323d-4cf6-8c4b-cf0faf6296b1
 health: HEALTH_OK

   services:
 mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
 mgr: tsyne(active), standbys: olkas, tolriq, lorunde,
amphel
 osd: 120 osds: 116 up, 116 in

   data:
 pools:   20 pools, 12736 pgs
 objects: 15.29M objects, 31.1TiB
 usage:   101TiB used, 75.3TiB / 177TiB avail
 pgs: 12732 active+clean
  4 active+clean+scrubbing+deep

   io:
 client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd,
1.29kop/s wr


 On an other host, in the same pool, I see also high memory
usage :

 daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
 ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21
1511:07 /usr/bin/ceph-osd -f --cluster ceph --id 131 

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
Well, Dan seems to be right :

_tune_cache_size
target: 4294967296
  heap: 6514409472
  unmapped: 2267537408
mapped: 4246872064
old cache_size: 2845396873
new cache size: 2845397085


So we have 6GB in heap, but "only" 4GB mapped.

But "ceph tell osd.* heap release" should had release that ?


Thanks,

Olivier


Le lundi 08 avril 2019 à 16:09 -0500, Mark Nelson a écrit :
> One of the difficulties with the osd_memory_target work is that we
> can't 
> tune based on the RSS memory usage of the process. Ultimately it's up
> to 
> the kernel to decide to reclaim memory and especially with
> transparent 
> huge pages it's tough to judge what the kernel is going to do even
> if 
> memory has been unmapped by the process.  Instead the autotuner looks
> at 
> how much memory has been mapped and tries to balance the caches based
> on 
> that.
> 
> 
> In addition to Dan's advice, you might also want to enable debug 
> bluestore at level 5 and look for lines containing "target:" and 
> "cache_size:".  These will tell you the current target, the mapped 
> memory, unmapped memory, heap size, previous aggregate cache size,
> and 
> new aggregate cache size.  The other line will give you a break down
> of 
> how much memory was assigned to each of the bluestore caches and how 
> much each case is using.  If there is a memory leak, the autotuner
> can 
> only do so much.  At some point it will reduce the caches to fit
> within 
> cache_min and leave it there.
> 
> 
> Mark
> 
> 
> On 4/8/19 5:18 AM, Dan van der Ster wrote:
> > Which OS are you using?
> > With CentOS we find that the heap is not always automatically
> > released. (You can check the heap freelist with `ceph tell osd.0
> > heap
> > stats`).
> > As a workaround we run this hourly:
> > 
> > ceph tell mon.* heap release
> > ceph tell osd.* heap release
> > ceph tell mds.* heap release
> > 
> > -- Dan
> > 
> > On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet <
> > ceph.l...@daevel.fr> wrote:
> > > Hi,
> > > 
> > > on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
> > > osd_memory_target :
> > > 
> > > daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> > > ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
> > > 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser
> > > ceph --setgroup ceph
> > > ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
> > > 1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser
> > > ceph --setgroup ceph
> > > ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
> > > 1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser
> > > ceph --setgroup ceph
> > > ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
> > > 2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser
> > > ceph --setgroup ceph
> > > ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
> > > 1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser
> > > ceph --setgroup ceph
> > > ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
> > > 1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser
> > > ceph --setgroup ceph
> > > ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
> > > 1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser
> > > ceph --setgroup ceph
> > > ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
> > > 1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser
> > > ceph --setgroup ceph
> > > 
> > > daevel-ob@ssdr712h:~$ free -m
> > >totalusedfree  shared  buff/ca
> > > che   available
> > > Mem:  47771   452101643  17 9
> > > 17   43556
> > > Swap: 0   0   0
> > > 
> > > # ceph daemon osd.147 config show | grep memory_target
> > >  "osd_memory_target": "4294967296",
> > > 
> > > 
> > > And there is no recovery / backfilling, the cluster is fine :
> > > 
> > > $ ceph status
> > >   cluster:
> > > id: de035250-323d-4cf6-8c4b-cf0faf6296b1
> > > health: HEALTH_OK
> > > 
> > >   services:
> > > mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
> > > mgr: tsyne(active), standbys: olkas, tolriq, lorunde,
> > > amphel
> > > osd: 120 osds: 116 up, 116 in
> > > 
> > >   data:
> > > pools:   20 pools, 12736 pgs
> > > objects: 15.29M objects, 31.1TiB
> > > usage:   101TiB used, 75.3TiB / 177TiB avail
> > > pgs: 12732 active+clean
> > >  4 active+clean+scrubbing+deep
> > > 
> > >   io:
> > > client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd,
> > > 1.29kop/s wr
> > > 
> > > 
> > > On an other host, in the same pool, I see also high memory
> > > usage :
> > > 
> > > daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
> > > ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21
> > > 1511:07 /usr/bin/ceph-osd -f --cluster ceph --id 131 --setuser
> > 

Re: [ceph-users] osd_memory_target exceeding on Luminous OSD BlueStore

2019-04-09 Thread Olivier Bonvalet
Thanks for the advice, we are using Debian 9 (stretch), with a custom
Linux kernel 4.14.

But "heap release" didn't help.


Le lundi 08 avril 2019 à 12:18 +0200, Dan van der Ster a écrit :
> Which OS are you using?
> With CentOS we find that the heap is not always automatically
> released. (You can check the heap freelist with `ceph tell osd.0 heap
> stats`).
> As a workaround we run this hourly:
> 
> ceph tell mon.* heap release
> ceph tell osd.* heap release
> ceph tell mds.* heap release
> 
> -- Dan
> 
> On Sat, Apr 6, 2019 at 1:30 PM Olivier Bonvalet 
> wrote:
> > Hi,
> > 
> > on a Luminous 12.2.11 deploiement, my bluestore OSD exceed the
> > osd_memory_target :
> > 
> > daevel-ob@ssdr712h:~$ ps auxw | grep ceph-osd
> > ceph3646 17.1 12.0 6828916 5893136 ? Ssl  mars29
> > 1903:42 /usr/bin/ceph-osd -f --cluster ceph --id 143 --setuser ceph
> > --setgroup ceph
> > ceph3991 12.9 11.2 6342812 5485356 ? Ssl  mars29
> > 1443:41 /usr/bin/ceph-osd -f --cluster ceph --id 144 --setuser ceph
> > --setgroup ceph
> > ceph4361 16.9 11.8 6718432 5783584 ? Ssl  mars29
> > 1889:41 /usr/bin/ceph-osd -f --cluster ceph --id 145 --setuser ceph
> > --setgroup ceph
> > ceph4731 19.7 12.2 6949584 5982040 ? Ssl  mars29
> > 2198:47 /usr/bin/ceph-osd -f --cluster ceph --id 146 --setuser ceph
> > --setgroup ceph
> > ceph5073 16.7 11.6 6639568 5701368 ? Ssl  mars29
> > 1866:05 /usr/bin/ceph-osd -f --cluster ceph --id 147 --setuser ceph
> > --setgroup ceph
> > ceph5417 14.6 11.2 6386764 5519944 ? Ssl  mars29
> > 1634:30 /usr/bin/ceph-osd -f --cluster ceph --id 148 --setuser ceph
> > --setgroup ceph
> > ceph5760 16.9 12.0 6806448 5879624 ? Ssl  mars29
> > 1882:42 /usr/bin/ceph-osd -f --cluster ceph --id 149 --setuser ceph
> > --setgroup ceph
> > ceph6105 16.0 11.6 6576336 5694556 ? Ssl  mars29
> > 1782:52 /usr/bin/ceph-osd -f --cluster ceph --id 150 --setuser ceph
> > --setgroup ceph
> > 
> > daevel-ob@ssdr712h:~$ free -m
> >   totalusedfree  shared  buff/cache
> >available
> > Mem:  47771   452101643  17 917
> >43556
> > Swap: 0   0   0
> > 
> > # ceph daemon osd.147 config show | grep memory_target
> > "osd_memory_target": "4294967296",
> > 
> > 
> > And there is no recovery / backfilling, the cluster is fine :
> > 
> >$ ceph status
> >  cluster:
> >id: de035250-323d-4cf6-8c4b-cf0faf6296b1
> >health: HEALTH_OK
> > 
> >  services:
> >mon: 5 daemons, quorum tolriq,tsyne,olkas,lorunde,amphel
> >mgr: tsyne(active), standbys: olkas, tolriq, lorunde, amphel
> >osd: 120 osds: 116 up, 116 in
> > 
> >  data:
> >pools:   20 pools, 12736 pgs
> >objects: 15.29M objects, 31.1TiB
> >usage:   101TiB used, 75.3TiB / 177TiB avail
> >pgs: 12732 active+clean
> > 4 active+clean+scrubbing+deep
> > 
> >  io:
> >client:   72.3MiB/s rd, 26.8MiB/s wr, 2.30kop/s rd,
> > 1.29kop/s wr
> > 
> > 
> >On an other host, in the same pool, I see also high memory usage
> > :
> > 
> >daevel-ob@ssdr712g:~$ ps auxw | grep ceph-osd
> >ceph6287  6.6 10.6 6027388 5190032 ? Ssl  mars21
> > 1511:07 /usr/bin/ceph-osd -f --cluster ceph --id 131 --setuser ceph
> > --setgroup ceph
> >ceph6759  7.3 11.2 6299140 5484412 ? Ssl  mars21
> > 1665:22 /usr/bin/ceph-osd -f --cluster ceph --id 132 --setuser ceph
> > --setgroup ceph
> >ceph7114  7.0 11.7 6576168 5756236 ? Ssl  mars21
> > 1612:09 /usr/bin/ceph-osd -f --cluster ceph --id 133 --setuser ceph
> > --setgroup ceph
> >ceph7467  7.4 11.1 6244668 5430512 ? Ssl  mars21
> > 1704:06 /usr/bin/ceph-osd -f --cluster ceph --id 134 --setuser ceph
> > --setgroup ceph
> >ceph7821  7.7 11.1 6309456 5469376 ? Ssl  mars21
> > 1754:35 /usr/bin/ceph-osd -f --cluster ceph --id 135 --setuser ceph
> > --setgroup ceph
> >ceph8174  6.9 11.6 6545224 5705412 ? Ssl  mars21
> > 1590:31 /usr/bin/ceph-osd -f --cluster ceph --id 136 --setuser ceph
> > --setgroup ceph
> >ceph8746  6.6 11.1 6290004 5477204 ? Ssl  mars21
> > 1511:11 /usr/bin/ceph-osd -f --cluster ceph --id 137 --setuser ceph
> > --setgroup ceph
> >ceph9100  7.7 11.6 6552080 5713560 ? Ssl  mars21
> > 1757:22 /usr/bin/ceph-osd -f --cluster ceph --id 138 --setuser ceph
> > --setgroup ceph
> > 
> >But ! On a similar host, in a different pool, the problem is
> > less visible :
> > 
> >daevel-ob@ssdr712i:~$ ps auxw | grep ceph-osd
> >ceph3617  2.8  9.9 5660308 4847444 ? Ssl  mars29
> > 313:05 /usr/bin/ceph-osd -f --cluster ceph --id 151 --setuser ceph
> > --setgroup ceph
> >ceph3958  2.3  9.8 5661936 4834320 ? Ssl  mars29
> > 256:55 /usr/bin/ceph-osd -f --cluster ceph --id 152 --setuser ceph
> > 

Re: [ceph-users] bluefs-bdev-expand experience

2019-04-09 Thread Yury Shevchuk
Igor, thank you, Round 2 is explained now.

Main aka block aka slow device cannot be expanded in Luminus, this
functionality will be available after upgrade to Nautilus.
Wal and db devices can be expanded in Luminous.

Now I have recreated osd2 once again to get rid of the paradoxical
cepf osd df output and tried to test db expansion, 40G -> 60G:

node2:/# ceph-volume lvm zap --destroy --osd-id 2
node2:/# ceph osd lost 2 --yes-i-really-mean-it
node2:/# ceph osd destroy 2 --yes-i-really-mean-it
node2:/# lvcreate -L1G -n osd2wal vg0
node2:/# lvcreate -L40G -n osd2db vg0
node2:/# lvcreate -L400G -n osd2 vg0
node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 
--block.db vg0/osd2db --block.wal vg0/osd2wal

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
 0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
 1   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
 3   hdd 0.227390 0B  0B 0B00   0
 2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

node2:/# lvextend -L+20G /dev/vg0/osd2db
  Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 
60.00 GiB (15360 extents).
  Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
 slot 0 /var/lib/ceph/osd/ceph-2//block.wal
 slot 1 /var/lib/ceph/osd/ceph-2//block.db
 slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0xf : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0xf
1 : size label updated to 64424509440

node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
"size": 64424509440,

The label updated correctly, but ceph osd df have not changed.
I expected to see 391 + 20 = 411GiB in AVAIL column, but it stays at 391:

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
 0   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
 1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
 3   hdd 0.227390 0B  0B 0B00   0
 2   hdd 0.22739  1.0 400GiB 9.49GiB 391GiB 2.37 0.72 128
TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

I have restarted monitors on all three nodes, 391GiB stays intact.

OK, but I used bluefs-bdev-expand on running OSD... probably not good,
it seems to fork by opening bluefs directly... trying once again:

node2:/# systemctl stop ceph-osd@2

node2:/# lvextend -L+20G /dev/vg0/osd2db
  Size of logical volume vg0/osd2db changed from 60.00 GiB (15360 extents) to 
80.00 GiB (20480 extents).
  Logical volume vg0/osd2db successfully resized.

node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
 slot 0 /var/lib/ceph/osd/ceph-2//block.wal
 slot 1 /var/lib/ceph/osd/ceph-2//block.db
 slot 2 /var/lib/ceph/osd/ceph-2//block
0 : size 0x4000 : own 0x[1000~3000]
1 : size 0x14 : own 0x[2000~9e000]
2 : size 0x64 : own 0x[30~4]
Expanding...
1 : expanding  from 0xa to 0x14
1 : size label updated to 85899345920

node2:/# systemctl start ceph-osd@2
node2:/# systemctl restart ceph-mon@pier42

node2:/# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE VAR  PGS
 0   hdd 0.22739  1.0 233GiB 9.49GiB 223GiB 4.08 1.24 128
 1   hdd 0.22739  1.0 233GiB 9.50GiB 223GiB 4.08 1.24 128
 3   hdd 0.227390 0B  0B 0B00   0
 2   hdd 0.22739  1.0 400GiB 9.50GiB 391GiB 2.37 0.72   0
TOTAL 866GiB 28.5GiB 837GiB 3.29
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83

Something is wrong.  Maybe I was wrong expecting db change to appear
in AVAIL column?  From Bluestore description I got db and slow should
sum up, no?

Thanks for your help,


-- Yury

On Mon, Apr 08, 2019 at 10:17:24PM +0300, Igor Fedotov wrote:
> Hi Yuri,
>
> both issues from Round 2 relate to unsupported expansion for main device.
>
> In fact it doesn't work and silently bypasses the operation in you case.
>
> Please try with a different device...
>
>
> Also I've just submitted a PR for mimic to indicate the bypass, will
> backport to Luminous once mimic patch is approved.
>
> See https://github.com/ceph/ceph/pull/27447
>
>
> Thanks,
>
> Igor
>
> On 4/5/2019 4:07 PM, Yury Shevchuk wrote:
> > On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:
> > > wrt Round 1 - an ability to expand block(main) device has been added to
> > > Nautilus,
> > >
> > > see: https://github.com/ceph/ceph/pull/25308
> > Oh, that's good.  But still separate wal may be good for studying
> > load on each volume (blktrace) or moving db/wal to another 

Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-09 Thread Francois Lafont

Hi,


On 4/9/19 5:02 AM, Pavan Rallabhandi wrote:


Refer "rgw log http headers" under 
http://docs.ceph.com/docs/nautilus/radosgw/config-ref/

Or even better in the code https://github.com/ceph/ceph/pull/7639



Ok, thx for your help Pavan. I have progressed but I have already some 
problems. With the help of this comment:

https://github.com/ceph/ceph/pull/7639#issuecomment-266893208

I have tried this config:

-
rgw enable ops log  = true
rgw ops log socket path = /tmp/opslog
rgw log http headers= http_x_forwarded_for
-

and I have logs in the socket /tmp/opslog like this:

-
{"bucket":"test1","time":"2019-04-09 09:41:18.188350Z","time_local":"2019-04-09 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET /?prefix=toto/=%2F 
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk 1.05 ( http://www.dragondisk.com 
)","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]},
-

I can see the IP address of the client in the value of HTTP_X_FORWARDED_FOR, 
that's cool.

But I don't understand why there is a specific socket to log that? I'm using radosgw in a Docker container 
(installed via ceph-ansible) and I have logs of the "radosgw" daemon in the 
"/var/log/syslog" file of my host (I'm using the Docker "syslog" log-driver).

1. Why is there a _separate_ log source for that? Indeed, in "/var/log/syslog" 
I have already some logs of civetweb. For instance:

2019-04-09 12:33:45.926 7f02e021c700  1 civetweb: 0x55876dc9c000: 10.111.222.51 - - 
[09/Apr/2019:12:33:45 +0200] "GET /?prefix=toto/=%2F HTTP/1.1" 200 
1014 - DragonDisk 1.05 ( http://www.dragondisk.com )

2. In my Docker container context, is it possible to put the logs above in the file 
"/var/log/syslog" of my host, in other words is it possible to make sure to log this in 
stdout of the daemon "radosgw"?

--
flaf
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com