Re: [ceph-users] Do you ever encountered a similar deadlock cephfs stack?

2018-10-22 Thread Yan, Zheng


> On Oct 22, 2018, at 23:06, ? ?  wrote:
> 
>  
> Hello:
>  Do you ever encountered a similar deadlock cephfs stack?
>  
> [Sat Oct 20 15:11:40 2018] INFO: task nfsd:27191 blocked for more than 120 
> seconds.
> [Sat Oct 20 15:11:40 2018]   Tainted: G   OE     
> 4.14.0-49.el7.centos.x86_64 #1
> [Sat Oct 20 15:11:40 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [Sat Oct 20 15:11:40 2018] nfsdD0 27191  2 0x8080
> [Sat Oct 20 15:11:40 2018] Call Trace:
> [Sat Oct 20 15:11:40 2018]  __schedule+0x28d/0x880
> [Sat Oct 20 15:11:40 2018]  schedule+0x36/0x80
> [Sat Oct 20 15:11:40 2018]  rwsem_down_write_failed+0x20d/0x380
> [Sat Oct 20 15:11:40 2018]  ? ip_finish_output2+0x15d/0x390
> [Sat Oct 20 15:11:40 2018]  call_rwsem_down_write_failed+0x17/0x30
> [Sat Oct 20 15:11:40 2018]  down_write+0x2d/0x40
> [Sat Oct 20 15:11:40 2018]  ceph_write_iter+0x101/0xf00 [ceph]
> [Sat Oct 20 15:11:40 2018]  ? __ceph_caps_issued_mask+0x1ed/0x200 [ceph]
> [Sat Oct 20 15:11:40 2018]  ? nfsd_acceptable+0xa3/0xe0 [nfsd]
> [Sat Oct 20 15:11:40 2018]  ? exportfs_decode_fh+0xd2/0x3e0
> [Sat Oct 20 15:11:40 2018]  ? nfsd_proc_read+0x1a0/0x1a0 [nfsd]
> [Sat Oct 20 15:11:40 2018]  do_iter_readv_writev+0x10b/0x170
> [Sat Oct 20 15:11:40 2018]  do_iter_write+0x7f/0x190
> [Sat Oct 20 15:11:40 2018]  vfs_iter_write+0x19/0x30
> [Sat Oct 20 15:11:40 2018]  nfsd_vfs_write+0xc6/0x360 [nfsd]
> [Sat Oct 20 15:11:40 2018]  nfsd4_write+0x1b8/0x260 [nfsd]
> [Sat Oct 20 15:11:40 2018]  ? nfsd4_encode_operation+0x13f/0x1c0 [nfsd]
> [Sat Oct 20 15:11:40 2018]  nfsd4_proc_compound+0x3e0/0x810 [nfsd]
> [Sat Oct 20 15:11:40 2018]  nfsd_dispatch+0xc9/0x2f0 [nfsd]
> [Sat Oct 20 15:11:40 2018]  svc_process_common+0x385/0x710 [sunrpc]
> [Sat Oct 20 15:11:40 2018]  svc_process+0xfd/0x1c0 [sunrpc]
> [Sat Oct 20 15:11:40 2018]  nfsd+0xf3/0x190 [nfsd]
> [Sat Oct 20 15:11:40 2018]  kthread+0x109/0x140
> [Sat Oct 20 15:11:40 2018]  ? nfsd_destroy+0x60/0x60 [nfsd]
> [Sat Oct 20 15:11:40 2018]  ? kthread_park+0x60/0x60
> [Sat Oct 20 15:11:40 2018]  ret_from_fork+0x25/0x30

I did see this before. Please run ‘ echo t > /proc/sysrq-trigger’ and send 
kernel log us if you encountered this again.

Yan, Zheng

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [ceph-ansible]Purging cluster using ceph-ansible stable 3.1/3.2

2018-10-22 Thread Cody
Hi folks,

I tried to purge a ceph cluster using
infrastructure-playbooks/purge-cluster.yml from stable 3.1 and stable 3.2
branches, but kept getting the following error immediately:

ERROR! no action detected in task. This often indicates a misspelled module
name, or incorrect module path.

The error appears to have been in
'/root/ceph-ansible/infrastructure-playbooks/purge-cluster.yml': line 353,
column 5, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


  - name: zap and destroy osds created by ceph-volume with lvm_volumes
^ here

Affected environment:
ceph-ansible stable 3.1 + ansible 2.4.2
ceph-ansible stable 3.2 + ansible 2.6.6

What could be wrong in my case?

Thank you to all.

Regards,
Cody
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner
I don't have enough disk space on the nvme. The DB would overflow before I
reached 25% utilization in the cluster. The disks are 10TB spinners and
would need a minimum of 100 GB of DB space minimum based on early testing.
The official docs recommend 400GB size DB for this size disk. I don't have
enough flash space for that in the 2x nvme disks in those servers.  Hence I
put the WAL soon the nvmes and left the DB on the data disk where it would
have spoiled over to almost immediately anyway.

On Mon, Oct 22, 2018, 6:55 PM solarflow99  wrote:

> Why didn't you just install the DB + WAL on the NVMe?  Is this "data disk"
> still an ssd?
>
>
>
> On Mon, Oct 22, 2018 at 3:34 PM David Turner 
> wrote:
>
>> And by the data disk I mean that I didn't specify a location for the DB
>> partition.
>>
>> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
>> wrote:
>>
>>> Track down where it says they point to?  Does it match what you expect?
>>> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>>>
>>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
>>> wrote:
>>>

  David - is it ensured that wal and db both live where the symlink
 block.db points?  I assumed that was a symlink for the db, but necessarily
 for the wal, because it can live in a place different than the db.

 On Mon, Oct 22, 2018 at 2:18 PM David Turner 
 wrote:

> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look
> at where the symlinks for block and block.wal point to.
>
> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
> rstanford8...@gmail.com> wrote:
>
>>
>>  That's what they say, however I did exactly this and my cluster
>> utilization is higher than the total pool utilization by about the number
>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>> I've asked here and no one seems to know a way to verify this.  Do you?
>>
>>  Thank you, R
>>
>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>> wrote:
>>
>>>
>>> If you specify a db on ssd and data on hdd and not explicitly
>>> specify a
>>> device for wal, wal will be placed on same ssd partition with db.
>>> Placing only wal on ssd or creating separate devices for wal and db
>>> are
>>> less common setups.
>>>
>>> /Maged
>>>
>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>> > Hi!
>>> >
>>> > For sharing SSD between WAL and DB what should be placed on SSD?
>>> WAL or DB?
>>> >
>>> > - Original Message -
>>> > From: "Maged Mokhtar" 
>>> > To: "ceph-users" 
>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>> > Subject: Re: [ceph-users] Drive for Wal and Db
>>> >
>>> > On 20/10/18 18:57, Robert Stanford wrote:
>>> >
>>> >
>>> >
>>> >
>>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD
>>> has a partition on an SSD for its DB. Wal is on the regular hard drives.
>>> Should I move the wal to share the SSD with the DB?
>>> >
>>> > Regards
>>> > R
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>>> ceph-users@lists.ceph.com ] [
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>>> >
>>> > you should put wal on the faster device, wal and db could share
>>> the same ssd partition,
>>> >
>>> > Maged
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread solarflow99
Why didn't you just install the DB + WAL on the NVMe?  Is this "data disk"
still an ssd?



On Mon, Oct 22, 2018 at 3:34 PM David Turner  wrote:

> And by the data disk I mean that I didn't specify a location for the DB
> partition.
>
> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
> wrote:
>
>> Track down where it says they point to?  Does it match what you expect?
>> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>>
>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  David - is it ensured that wal and db both live where the symlink
>>> block.db points?  I assumed that was a symlink for the db, but necessarily
>>> for the wal, because it can live in a place different than the db.
>>>
>>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>>> wrote:
>>>
 You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
 where the symlinks for block and block.wal point to.

 On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
 rstanford8...@gmail.com> wrote:

>
>  That's what they say, however I did exactly this and my cluster
> utilization is higher than the total pool utilization by about the number
> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
> I've asked here and no one seems to know a way to verify this.  Do you?
>
>  Thank you, R
>
> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
> wrote:
>
>>
>> If you specify a db on ssd and data on hdd and not explicitly specify
>> a
>> device for wal, wal will be placed on same ssd partition with db.
>> Placing only wal on ssd or creating separate devices for wal and db
>> are
>> less common setups.
>>
>> /Maged
>>
>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>> > Hi!
>> >
>> > For sharing SSD between WAL and DB what should be placed on SSD?
>> WAL or DB?
>> >
>> > - Original Message -
>> > From: "Maged Mokhtar" 
>> > To: "ceph-users" 
>> > Sent: Saturday, 20 October, 2018 20:05:44
>> > Subject: Re: [ceph-users] Drive for Wal and Db
>> >
>> > On 20/10/18 18:57, Robert Stanford wrote:
>> >
>> >
>> >
>> >
>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has
>> a partition on an SSD for its DB. Wal is on the regular hard drives. 
>> Should
>> I move the wal to share the SSD with the DB?
>> >
>> > Regards
>> > R
>> >
>> >
>> > ___
>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>> ceph-users@lists.ceph.com ] [
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>> >
>> > you should put wal on the faster device, wal and db could share the
>> same ssd partition,
>> >
>> > Maged
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
 ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW stale buckets

2018-10-22 Thread Robert Stanford
 Someone deleted our rgw data pool to clean up.  They recreated it
afterward.  This is fine in one respect, we don't need the data.  But
listing with radosgw-admin still shows all the buckets.  How can we clean
things up and get rgw to understand what actually exists, and what doesn't?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner
No, it's exactly what I told you it was.  "bluestore_bdev_partition_path"
is the data path.  In all of my scenarios my DB and Data are on the same
partition, hence mine are the same.  Your DB and WAL are on a different
partition from your Data... so your DB partition is different... Whatever
your misunderstanding is about where/why your cluster's usage is
higher/different than you think it is, it has nothing to do with where your
DB and WAL partitions are.

There is a overhead just for having a FS on the disk.  In this case that FS
is bluestore.  You can look at [1] this ML thread from a while ago where I
mentioned a brand new cluster with no data in it and the WAL partitions on
separate disks that it was using about 1.1GB of data per OSD.

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025246.html
On Mon, Oct 22, 2018 at 4:51 PM Robert Stanford 
wrote:

>
>  That's very helpful, thanks.  In your first case above your
> bluefs_db_partition_path and bluestore_bdev_partition path are the same.
> Though I have a different data and db drive, mine are different.  Might
> this explain something?  My root concern is that there is more utilization
> on the cluster than what's in the pools, the excess equal to about wal size
> * number of osds...
>
> On Mon, Oct 22, 2018 at 3:35 PM David Turner 
> wrote:
>
>> My DB doesn't have a specific partition anywhere, but there's still a
>> symlink for it to the data partition.  On my home cluster with all DB, WAL,
>> and Data on the same disk without any partitions specified there is a block
>> symlink but no block.wal symlink.
>>
>> For the cluster with a specific WAL partition, but no DB partition, my
>> OSD paths looks like [1] this.  For my cluster with everything on the same
>> disk, my OSD paths look like [2] this.  Unless you have a specific path for
>> "bluefs_wal_partition_path" then it's going to find itself on the same
>> partition as the db.
>>
>> [1] $ ceph osd metadata 5 | grep path
>> "bluefs_db_partition_path": "/dev/dm-29",
>> "bluefs_wal_partition_path": "/dev/dm-41",
>> "bluestore_bdev_partition_path": "/dev/dm-29",
>>
>> [2] $ ceph osd metadata 5 | grep path
>> "bluefs_db_partition_path": "/dev/dm-5",
>> "bluestore_bdev_partition_path": "/dev/dm-5",
>>
>> On Mon, Oct 22, 2018 at 4:21 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  Let me add, I have no block.wal file (which the docs suggest should be
>>> there).
>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>>>
>>> On Mon, Oct 22, 2018 at 3:13 PM Robert Stanford 
>>> wrote:
>>>

  We're out of sync, I think.  You have your DB on your data disk so
 your block.db symlink points to that disk, right?  There is however no wal
 symlink?  So how would you verify your WAL actually lived on your NVMe?

 On Mon, Oct 22, 2018 at 3:07 PM David Turner 
 wrote:

> And by the data disk I mean that I didn't specify a location for the
> DB partition.
>
> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
> wrote:
>
>> Track down where it says they point to?  Does it match what you
>> expect?  It does for me.  I have my DB on my data disk and my WAL on a
>> separate NVMe.
>>
>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford <
>> rstanford8...@gmail.com> wrote:
>>
>>>
>>>  David - is it ensured that wal and db both live where the symlink
>>> block.db points?  I assumed that was a symlink for the db, but 
>>> necessarily
>>> for the wal, because it can live in a place different than the db.
>>>
>>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>>> wrote:
>>>
 You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and
 look at where the symlinks for block and block.wal point to.

 On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
 rstanford8...@gmail.com> wrote:

>
>  That's what they say, however I did exactly this and my cluster
> utilization is higher than the total pool utilization by about the 
> number
> of OSDs * wal size.  I want to verify that the wal is on the SSDs too 
> but
> I've asked here and no one seems to know a way to verify this.  Do 
> you?
>
>  Thank you, R
>
> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar <
> mmokh...@petasan.org> wrote:
>
>>
>> If you specify a db on ssd and data on hdd and not explicitly
>> specify a
>> device for wal, wal will be placed on same ssd partition with db.
>> Placing only wal on ssd or creating separate devices for wal and
>> db are
>> less common setups.
>>
>> /Maged
>>
>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>> > Hi!
>> >
>> > For sharing SSD between WAL and DB what should be placed on
>> SSD? WAL or DB?

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Robert Stanford
 That's very helpful, thanks.  In your first case above your
bluefs_db_partition_path and bluestore_bdev_partition path are the same.
Though I have a different data and db drive, mine are different.  Might
this explain something?  My root concern is that there is more utilization
on the cluster than what's in the pools, the excess equal to about wal size
* number of osds...

On Mon, Oct 22, 2018 at 3:35 PM David Turner  wrote:

> My DB doesn't have a specific partition anywhere, but there's still a
> symlink for it to the data partition.  On my home cluster with all DB, WAL,
> and Data on the same disk without any partitions specified there is a block
> symlink but no block.wal symlink.
>
> For the cluster with a specific WAL partition, but no DB partition, my OSD
> paths looks like [1] this.  For my cluster with everything on the same
> disk, my OSD paths look like [2] this.  Unless you have a specific path for
> "bluefs_wal_partition_path" then it's going to find itself on the same
> partition as the db.
>
> [1] $ ceph osd metadata 5 | grep path
> "bluefs_db_partition_path": "/dev/dm-29",
> "bluefs_wal_partition_path": "/dev/dm-41",
> "bluestore_bdev_partition_path": "/dev/dm-29",
>
> [2] $ ceph osd metadata 5 | grep path
> "bluefs_db_partition_path": "/dev/dm-5",
> "bluestore_bdev_partition_path": "/dev/dm-5",
>
> On Mon, Oct 22, 2018 at 4:21 PM Robert Stanford 
> wrote:
>
>>
>>  Let me add, I have no block.wal file (which the docs suggest should be
>> there).
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>>
>> On Mon, Oct 22, 2018 at 3:13 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  We're out of sync, I think.  You have your DB on your data disk so your
>>> block.db symlink points to that disk, right?  There is however no wal
>>> symlink?  So how would you verify your WAL actually lived on your NVMe?
>>>
>>> On Mon, Oct 22, 2018 at 3:07 PM David Turner 
>>> wrote:
>>>
 And by the data disk I mean that I didn't specify a location for the DB
 partition.

 On Mon, Oct 22, 2018 at 4:06 PM David Turner 
 wrote:

> Track down where it says they point to?  Does it match what you
> expect?  It does for me.  I have my DB on my data disk and my WAL on a
> separate NVMe.
>
> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford <
> rstanford8...@gmail.com> wrote:
>
>>
>>  David - is it ensured that wal and db both live where the symlink
>> block.db points?  I assumed that was a symlink for the db, but 
>> necessarily
>> for the wal, because it can live in a place different than the db.
>>
>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>> wrote:
>>
>>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look
>>> at where the symlinks for block and block.wal point to.
>>>
>>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>>> rstanford8...@gmail.com> wrote:
>>>

  That's what they say, however I did exactly this and my cluster
 utilization is higher than the total pool utilization by about the 
 number
 of OSDs * wal size.  I want to verify that the wal is on the SSDs too 
 but
 I've asked here and no one seems to know a way to verify this.  Do you?

  Thank you, R

 On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
 wrote:

>
> If you specify a db on ssd and data on hdd and not explicitly
> specify a
> device for wal, wal will be placed on same ssd partition with db.
> Placing only wal on ssd or creating separate devices for wal and
> db are
> less common setups.
>
> /Maged
>
> On 22/10/18 09:03, Fyodor Ustinov wrote:
> > Hi!
> >
> > For sharing SSD between WAL and DB what should be placed on SSD?
> WAL or DB?
> >
> > - Original Message -
> > From: "Maged Mokhtar" 
> > To: "ceph-users" 
> > Sent: Saturday, 20 October, 2018 20:05:44
> > Subject: Re: [ceph-users] Drive for Wal and Db
> >
> > On 20/10/18 18:57, Robert Stanford wrote:
> >
> >
> >
> >
> > Our OSDs are BlueStore and are on regular hard drives. Each OSD
> has a partition on an SSD for its DB. Wal is on the regular hard 
> drives.
> Should I move the wal to share the SSD with the DB?
> >
> > Regards
> > R
> >
> >
> > ___
> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
> ceph-users@lists.ceph.com ] [
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
> >
> > you should put wal on the faster device, wal and db could 

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner
My DB doesn't have a specific partition anywhere, but there's still a
symlink for it to the data partition.  On my home cluster with all DB, WAL,
and Data on the same disk without any partitions specified there is a block
symlink but no block.wal symlink.

For the cluster with a specific WAL partition, but no DB partition, my OSD
paths looks like [1] this.  For my cluster with everything on the same
disk, my OSD paths look like [2] this.  Unless you have a specific path for
"bluefs_wal_partition_path" then it's going to find itself on the same
partition as the db.

[1] $ ceph osd metadata 5 | grep path
"bluefs_db_partition_path": "/dev/dm-29",
"bluefs_wal_partition_path": "/dev/dm-41",
"bluestore_bdev_partition_path": "/dev/dm-29",

[2] $ ceph osd metadata 5 | grep path
"bluefs_db_partition_path": "/dev/dm-5",
"bluestore_bdev_partition_path": "/dev/dm-5",

On Mon, Oct 22, 2018 at 4:21 PM Robert Stanford 
wrote:

>
>  Let me add, I have no block.wal file (which the docs suggest should be
> there).
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
>
> On Mon, Oct 22, 2018 at 3:13 PM Robert Stanford 
> wrote:
>
>>
>>  We're out of sync, I think.  You have your DB on your data disk so your
>> block.db symlink points to that disk, right?  There is however no wal
>> symlink?  So how would you verify your WAL actually lived on your NVMe?
>>
>> On Mon, Oct 22, 2018 at 3:07 PM David Turner 
>> wrote:
>>
>>> And by the data disk I mean that I didn't specify a location for the DB
>>> partition.
>>>
>>> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
>>> wrote:
>>>
 Track down where it says they point to?  Does it match what you
 expect?  It does for me.  I have my DB on my data disk and my WAL on a
 separate NVMe.

 On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford <
 rstanford8...@gmail.com> wrote:

>
>  David - is it ensured that wal and db both live where the symlink
> block.db points?  I assumed that was a symlink for the db, but necessarily
> for the wal, because it can live in a place different than the db.
>
> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
> wrote:
>
>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look
>> at where the symlinks for block and block.wal point to.
>>
>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>> rstanford8...@gmail.com> wrote:
>>
>>>
>>>  That's what they say, however I did exactly this and my cluster
>>> utilization is higher than the total pool utilization by about the 
>>> number
>>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too 
>>> but
>>> I've asked here and no one seems to know a way to verify this.  Do you?
>>>
>>>  Thank you, R
>>>
>>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>>> wrote:
>>>

 If you specify a db on ssd and data on hdd and not explicitly
 specify a
 device for wal, wal will be placed on same ssd partition with db.
 Placing only wal on ssd or creating separate devices for wal and db
 are
 less common setups.

 /Maged

 On 22/10/18 09:03, Fyodor Ustinov wrote:
 > Hi!
 >
 > For sharing SSD between WAL and DB what should be placed on SSD?
 WAL or DB?
 >
 > - Original Message -
 > From: "Maged Mokhtar" 
 > To: "ceph-users" 
 > Sent: Saturday, 20 October, 2018 20:05:44
 > Subject: Re: [ceph-users] Drive for Wal and Db
 >
 > On 20/10/18 18:57, Robert Stanford wrote:
 >
 >
 >
 >
 > Our OSDs are BlueStore and are on regular hard drives. Each OSD
 has a partition on an SSD for its DB. Wal is on the regular hard 
 drives.
 Should I move the wal to share the SSD with the DB?
 >
 > Regards
 > R
 >
 >
 > ___
 > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
 ceph-users@lists.ceph.com ] [
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
 >
 > you should put wal on the faster device, wal and db could share
 the same ssd partition,
 >
 > Maged
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Robert Stanford
 Let me add, I have no block.wal file (which the docs suggest should be
there).
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/

On Mon, Oct 22, 2018 at 3:13 PM Robert Stanford 
wrote:

>
>  We're out of sync, I think.  You have your DB on your data disk so your
> block.db symlink points to that disk, right?  There is however no wal
> symlink?  So how would you verify your WAL actually lived on your NVMe?
>
> On Mon, Oct 22, 2018 at 3:07 PM David Turner 
> wrote:
>
>> And by the data disk I mean that I didn't specify a location for the DB
>> partition.
>>
>> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
>> wrote:
>>
>>> Track down where it says they point to?  Does it match what you expect?
>>> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>>>
>>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
>>> wrote:
>>>

  David - is it ensured that wal and db both live where the symlink
 block.db points?  I assumed that was a symlink for the db, but necessarily
 for the wal, because it can live in a place different than the db.

 On Mon, Oct 22, 2018 at 2:18 PM David Turner 
 wrote:

> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look
> at where the symlinks for block and block.wal point to.
>
> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
> rstanford8...@gmail.com> wrote:
>
>>
>>  That's what they say, however I did exactly this and my cluster
>> utilization is higher than the total pool utilization by about the number
>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>> I've asked here and no one seems to know a way to verify this.  Do you?
>>
>>  Thank you, R
>>
>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>> wrote:
>>
>>>
>>> If you specify a db on ssd and data on hdd and not explicitly
>>> specify a
>>> device for wal, wal will be placed on same ssd partition with db.
>>> Placing only wal on ssd or creating separate devices for wal and db
>>> are
>>> less common setups.
>>>
>>> /Maged
>>>
>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>> > Hi!
>>> >
>>> > For sharing SSD between WAL and DB what should be placed on SSD?
>>> WAL or DB?
>>> >
>>> > - Original Message -
>>> > From: "Maged Mokhtar" 
>>> > To: "ceph-users" 
>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>> > Subject: Re: [ceph-users] Drive for Wal and Db
>>> >
>>> > On 20/10/18 18:57, Robert Stanford wrote:
>>> >
>>> >
>>> >
>>> >
>>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD
>>> has a partition on an SSD for its DB. Wal is on the regular hard drives.
>>> Should I move the wal to share the SSD with the DB?
>>> >
>>> > Regards
>>> > R
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>>> ceph-users@lists.ceph.com ] [
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>>> >
>>> > you should put wal on the faster device, wal and db could share
>>> the same ssd partition,
>>> >
>>> > Maged
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Robert Stanford
 We're out of sync, I think.  You have your DB on your data disk so your
block.db symlink points to that disk, right?  There is however no wal
symlink?  So how would you verify your WAL actually lived on your NVMe?

On Mon, Oct 22, 2018 at 3:07 PM David Turner  wrote:

> And by the data disk I mean that I didn't specify a location for the DB
> partition.
>
> On Mon, Oct 22, 2018 at 4:06 PM David Turner 
> wrote:
>
>> Track down where it says they point to?  Does it match what you expect?
>> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>>
>> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  David - is it ensured that wal and db both live where the symlink
>>> block.db points?  I assumed that was a symlink for the db, but necessarily
>>> for the wal, because it can live in a place different than the db.
>>>
>>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>>> wrote:
>>>
 You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
 where the symlinks for block and block.wal point to.

 On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
 rstanford8...@gmail.com> wrote:

>
>  That's what they say, however I did exactly this and my cluster
> utilization is higher than the total pool utilization by about the number
> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
> I've asked here and no one seems to know a way to verify this.  Do you?
>
>  Thank you, R
>
> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
> wrote:
>
>>
>> If you specify a db on ssd and data on hdd and not explicitly specify
>> a
>> device for wal, wal will be placed on same ssd partition with db.
>> Placing only wal on ssd or creating separate devices for wal and db
>> are
>> less common setups.
>>
>> /Maged
>>
>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>> > Hi!
>> >
>> > For sharing SSD between WAL and DB what should be placed on SSD?
>> WAL or DB?
>> >
>> > - Original Message -
>> > From: "Maged Mokhtar" 
>> > To: "ceph-users" 
>> > Sent: Saturday, 20 October, 2018 20:05:44
>> > Subject: Re: [ceph-users] Drive for Wal and Db
>> >
>> > On 20/10/18 18:57, Robert Stanford wrote:
>> >
>> >
>> >
>> >
>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has
>> a partition on an SSD for its DB. Wal is on the regular hard drives. 
>> Should
>> I move the wal to share the SSD with the DB?
>> >
>> > Regards
>> > R
>> >
>> >
>> > ___
>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>> ceph-users@lists.ceph.com ] [
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>> >
>> > you should put wal on the faster device, wal and db could share the
>> same ssd partition,
>> >
>> > Maged
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner
Track down where it says they point to?  Does it match what you expect?  It
does for me.  I have my DB on my data disk and my WAL on a separate NVMe.

On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
wrote:

>
>  David - is it ensured that wal and db both live where the symlink
> block.db points?  I assumed that was a symlink for the db, but necessarily
> for the wal, because it can live in a place different than the db.
>
> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
> wrote:
>
>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
>> where the symlinks for block and block.wal point to.
>>
>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford 
>> wrote:
>>
>>>
>>>  That's what they say, however I did exactly this and my cluster
>>> utilization is higher than the total pool utilization by about the number
>>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>>> I've asked here and no one seems to know a way to verify this.  Do you?
>>>
>>>  Thank you, R
>>>
>>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>>> wrote:
>>>

 If you specify a db on ssd and data on hdd and not explicitly specify a
 device for wal, wal will be placed on same ssd partition with db.
 Placing only wal on ssd or creating separate devices for wal and db are
 less common setups.

 /Maged

 On 22/10/18 09:03, Fyodor Ustinov wrote:
 > Hi!
 >
 > For sharing SSD between WAL and DB what should be placed on SSD? WAL
 or DB?
 >
 > - Original Message -
 > From: "Maged Mokhtar" 
 > To: "ceph-users" 
 > Sent: Saturday, 20 October, 2018 20:05:44
 > Subject: Re: [ceph-users] Drive for Wal and Db
 >
 > On 20/10/18 18:57, Robert Stanford wrote:
 >
 >
 >
 >
 > Our OSDs are BlueStore and are on regular hard drives. Each OSD has a
 partition on an SSD for its DB. Wal is on the regular hard drives. Should I
 move the wal to share the SSD with the DB?
 >
 > Regards
 > R
 >
 >
 > ___
 > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
 ceph-users@lists.ceph.com ] [
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
 >
 > you should put wal on the faster device, wal and db could share the
 same ssd partition,
 >
 > Maged
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner
And by the data disk I mean that I didn't specify a location for the DB
partition.

On Mon, Oct 22, 2018 at 4:06 PM David Turner  wrote:

> Track down where it says they point to?  Does it match what you expect?
> It does for me.  I have my DB on my data disk and my WAL on a separate NVMe.
>
> On Mon, Oct 22, 2018 at 3:21 PM Robert Stanford 
> wrote:
>
>>
>>  David - is it ensured that wal and db both live where the symlink
>> block.db points?  I assumed that was a symlink for the db, but necessarily
>> for the wal, because it can live in a place different than the db.
>>
>> On Mon, Oct 22, 2018 at 2:18 PM David Turner 
>> wrote:
>>
>>> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
>>> where the symlinks for block and block.wal point to.
>>>
>>> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford <
>>> rstanford8...@gmail.com> wrote:
>>>

  That's what they say, however I did exactly this and my cluster
 utilization is higher than the total pool utilization by about the number
 of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
 I've asked here and no one seems to know a way to verify this.  Do you?

  Thank you, R

 On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
 wrote:

>
> If you specify a db on ssd and data on hdd and not explicitly specify
> a
> device for wal, wal will be placed on same ssd partition with db.
> Placing only wal on ssd or creating separate devices for wal and db
> are
> less common setups.
>
> /Maged
>
> On 22/10/18 09:03, Fyodor Ustinov wrote:
> > Hi!
> >
> > For sharing SSD between WAL and DB what should be placed on SSD? WAL
> or DB?
> >
> > - Original Message -
> > From: "Maged Mokhtar" 
> > To: "ceph-users" 
> > Sent: Saturday, 20 October, 2018 20:05:44
> > Subject: Re: [ceph-users] Drive for Wal and Db
> >
> > On 20/10/18 18:57, Robert Stanford wrote:
> >
> >
> >
> >
> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has
> a partition on an SSD for its DB. Wal is on the regular hard drives. 
> Should
> I move the wal to share the SSD with the DB?
> >
> > Regards
> > R
> >
> >
> > ___
> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
> ceph-users@lists.ceph.com ] [
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
> >
> > you should put wal on the faster device, wal and db could share the
> same ssd partition,
> >
> > Maged
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Robert Stanford
 David - is it ensured that wal and db both live where the symlink block.db
points?  I assumed that was a symlink for the db, but necessarily for the
wal, because it can live in a place different than the db.

On Mon, Oct 22, 2018 at 2:18 PM David Turner  wrote:

> You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
> where the symlinks for block and block.wal point to.
>
> On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford 
> wrote:
>
>>
>>  That's what they say, however I did exactly this and my cluster
>> utilization is higher than the total pool utilization by about the number
>> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
>> I've asked here and no one seems to know a way to verify this.  Do you?
>>
>>  Thank you, R
>>
>> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
>> wrote:
>>
>>>
>>> If you specify a db on ssd and data on hdd and not explicitly specify a
>>> device for wal, wal will be placed on same ssd partition with db.
>>> Placing only wal on ssd or creating separate devices for wal and db are
>>> less common setups.
>>>
>>> /Maged
>>>
>>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>>> > Hi!
>>> >
>>> > For sharing SSD between WAL and DB what should be placed on SSD? WAL
>>> or DB?
>>> >
>>> > - Original Message -
>>> > From: "Maged Mokhtar" 
>>> > To: "ceph-users" 
>>> > Sent: Saturday, 20 October, 2018 20:05:44
>>> > Subject: Re: [ceph-users] Drive for Wal and Db
>>> >
>>> > On 20/10/18 18:57, Robert Stanford wrote:
>>> >
>>> >
>>> >
>>> >
>>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has a
>>> partition on an SSD for its DB. Wal is on the regular hard drives. Should I
>>> move the wal to share the SSD with the DB?
>>> >
>>> > Regards
>>> > R
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>>> ceph-users@lists.ceph.com ] [
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>>> >
>>> > you should put wal on the faster device, wal and db could share the
>>> same ssd partition,
>>> >
>>> > Maged
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread David Turner
You can always just go to /var/lib/ceph/osd/ceph-{osd-num}/ and look at
where the symlinks for block and block.wal point to.

On Mon, Oct 22, 2018 at 12:29 PM Robert Stanford 
wrote:

>
>  That's what they say, however I did exactly this and my cluster
> utilization is higher than the total pool utilization by about the number
> of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
> I've asked here and no one seems to know a way to verify this.  Do you?
>
>  Thank you, R
>
> On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar 
> wrote:
>
>>
>> If you specify a db on ssd and data on hdd and not explicitly specify a
>> device for wal, wal will be placed on same ssd partition with db.
>> Placing only wal on ssd or creating separate devices for wal and db are
>> less common setups.
>>
>> /Maged
>>
>> On 22/10/18 09:03, Fyodor Ustinov wrote:
>> > Hi!
>> >
>> > For sharing SSD between WAL and DB what should be placed on SSD? WAL or
>> DB?
>> >
>> > - Original Message -
>> > From: "Maged Mokhtar" 
>> > To: "ceph-users" 
>> > Sent: Saturday, 20 October, 2018 20:05:44
>> > Subject: Re: [ceph-users] Drive for Wal and Db
>> >
>> > On 20/10/18 18:57, Robert Stanford wrote:
>> >
>> >
>> >
>> >
>> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has a
>> partition on an SSD for its DB. Wal is on the regular hard drives. Should I
>> move the wal to share the SSD with the DB?
>> >
>> > Regards
>> > R
>> >
>> >
>> > ___
>> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
>> ceph-users@lists.ceph.com ] [
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>> >
>> > you should put wal on the faster device, wal and db could share the
>> same ssd partition,
>> >
>> > Maged
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-22 Thread David Turner
I haven't had crush-compat do anything helpful for balancing my clusters.
upmap has been amazing and balanced my clusters far better than anything
else I've ever seen.  I would go so far as to say that upmap can achieve a
perfect balance.

It seems to evenly distribute the PGs for each pool onto all OSDs that pool
is on.  It does that with a maximum difference of 1PG based on how
divisible the number of PGs are with the number of OSDs you have.  As a
side note, your OSD CRUSH weights should be the default weights for their
size for upmap to be as effective as it can be.

On Sat, Oct 20, 2018 at 3:58 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Ok, I'll try out the balancer end of the upcoming week then (after we've
> fixed a HW-issue with one of our mons
> and the cooling system).
>
> Until then, any further advice and whether upmap is recommended over
> crush-compat (all clients are Luminous) are welcome ;-).
>
> Cheers,
> Oliver
>
> Am 20.10.18 um 21:26 schrieb Janne Johansson:
> > Ok, can't say "why" then, I'd reweigh them somewhat to even it out,
> > 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for
> > the MGRs, a script or just a few manual tweaks might be in order.
> >
> > Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth
> > :
> >>
> >> All OSDs are of the very same size. One OSD host has slightly more
> disks (33 instead of 31), though.
> >> So also that that can't explain the hefty difference.
> >>
> >> I attach the output of "ceph osd tree" and "ceph osd df".
> >>
> >> The crush rule for the ceph_data pool is:
> >> rule cephfs_data {
> >> id 2
> >> type erasure
> >> min_size 3
> >> max_size 6
> >> step set_chooseleaf_tries 5
> >> step set_choose_tries 100
> >> step take default class hdd
> >> step chooseleaf indep 0 type host
> >> step emit
> >> }
> >> So that only considers the hdd device class. EC is done with k=4 m=2.
> >>
> >> So I don't see any imbalance on the hardware level, but only a somewhat
> uneven distribution of PGs.
> >> Am I missing something, or is this really just a case for the ceph
> balancer plugin?
> >> I'm just a bit astonished this effect is so huge.
> >> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even
> distribution without balancing?
> >> But it yields about 100 PGs per OSD, as you can see...
> >>
> >> --
> >> # ceph osd tree
> >> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
> >>  -1   826.26428 root default
> >>  -3 0.43700 host mon001
> >>   0   ssd   0.21799 osd.0   up  1.0 1.0
> >>   1   ssd   0.21799 osd.1   up  1.0 1.0
> >>  -5 0.43700 host mon002
> >>   2   ssd   0.21799 osd.2   up  1.0 1.0
> >>   3   ssd   0.21799 osd.3   up  1.0 1.0
> >> -31 1.81898 host mon003
> >> 230   ssd   0.90999 osd.230 up  1.0 1.0
> >> 231   ssd   0.90999 osd.231 up  1.0 1.0
> >> -10   116.64600 host osd001
> >>   4   hdd   3.64499 osd.4   up  1.0 1.0
> >>   5   hdd   3.64499 osd.5   up  1.0 1.0
> >>   6   hdd   3.64499 osd.6   up  1.0 1.0
> >>   7   hdd   3.64499 osd.7   up  1.0 1.0
> >>   8   hdd   3.64499 osd.8   up  1.0 1.0
> >>   9   hdd   3.64499 osd.9   up  1.0 1.0
> >>  10   hdd   3.64499 osd.10  up  1.0 1.0
> >>  11   hdd   3.64499 osd.11  up  1.0 1.0
> >>  12   hdd   3.64499 osd.12  up  1.0 1.0
> >>  13   hdd   3.64499 osd.13  up  1.0 1.0
> >>  14   hdd   3.64499 osd.14  up  1.0 1.0
> >>  15   hdd   3.64499 osd.15  up  1.0 1.0
> >>  16   hdd   3.64499 osd.16  up  1.0 1.0
> >>  17   hdd   3.64499 osd.17  up  1.0 1.0
> >>  18   hdd   3.64499 osd.18  up  1.0 1.0
> >>  19   hdd   3.64499 osd.19  up  1.0 1.0
> >>  20   hdd   3.64499 osd.20  up  1.0 1.0
> >>  21   hdd   3.64499 osd.21  up  1.0 1.0
> >>  22   hdd   3.64499 osd.22  up  1.0 1.0
> >>  23   hdd   3.64499 osd.23  up  1.0 1.0
> >>  24   hdd   3.64499 osd.24  up  1.0 1.0
> >>  25   hdd   3.64499 osd.25  up  1.0 1.0
> >>  26   hdd   3.64499 osd.26  up  1.0 1.0
> >>  27   hdd   3.64499 osd.27  up  1.0 1.0
> >>  28   hdd   3.64499 osd.28  up  1.0 1.0
> >>  29   hdd   3.64499 osd.29  up  1.0 1.0
> >>  30   hdd   3.64499 osd.30  up  1.0 1.0
> >>  31   hdd   3.64499 osd.31  up  1.0 1.0
> >>  32   hdd   

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Robert Stanford
 That's what they say, however I did exactly this and my cluster
utilization is higher than the total pool utilization by about the number
of OSDs * wal size.  I want to verify that the wal is on the SSDs too but
I've asked here and no one seems to know a way to verify this.  Do you?

 Thank you, R

On Mon, Oct 22, 2018 at 5:22 AM Maged Mokhtar  wrote:

>
> If you specify a db on ssd and data on hdd and not explicitly specify a
> device for wal, wal will be placed on same ssd partition with db.
> Placing only wal on ssd or creating separate devices for wal and db are
> less common setups.
>
> /Maged
>
> On 22/10/18 09:03, Fyodor Ustinov wrote:
> > Hi!
> >
> > For sharing SSD between WAL and DB what should be placed on SSD? WAL or
> DB?
> >
> > - Original Message -
> > From: "Maged Mokhtar" 
> > To: "ceph-users" 
> > Sent: Saturday, 20 October, 2018 20:05:44
> > Subject: Re: [ceph-users] Drive for Wal and Db
> >
> > On 20/10/18 18:57, Robert Stanford wrote:
> >
> >
> >
> >
> > Our OSDs are BlueStore and are on regular hard drives. Each OSD has a
> partition on an SSD for its DB. Wal is on the regular hard drives. Should I
> move the wal to share the SSD with the DB?
> >
> > Regards
> > R
> >
> >
> > ___
> > ceph-users mailing list [ mailto:ceph-users@lists.ceph.com |
> ceph-users@lists.ceph.com ] [
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
> >
> > you should put wal on the faster device, wal and db could share the same
> ssd partition,
> >
> > Maged
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Do you ever encountered a similar deadlock cephfs stack?

2018-10-22 Thread ? ?

Hello:
 Do you ever encountered a similar deadlock cephfs stack?

[Sat Oct 20 15:11:40 2018] INFO: task nfsd:27191 blocked for more than 120 
seconds.
[Sat Oct 20 15:11:40 2018]   Tainted: G   OE     
4.14.0-49.el7.centos.x86_64 #1
[Sat Oct 20 15:11:40 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Sat Oct 20 15:11:40 2018] nfsdD0 27191  2 0x8080
[Sat Oct 20 15:11:40 2018] Call Trace:
[Sat Oct 20 15:11:40 2018]  __schedule+0x28d/0x880
[Sat Oct 20 15:11:40 2018]  schedule+0x36/0x80
[Sat Oct 20 15:11:40 2018]  rwsem_down_write_failed+0x20d/0x380
[Sat Oct 20 15:11:40 2018]  ? ip_finish_output2+0x15d/0x390
[Sat Oct 20 15:11:40 2018]  call_rwsem_down_write_failed+0x17/0x30
[Sat Oct 20 15:11:40 2018]  down_write+0x2d/0x40
[Sat Oct 20 15:11:40 2018]  ceph_write_iter+0x101/0xf00 [ceph]
[Sat Oct 20 15:11:40 2018]  ? __ceph_caps_issued_mask+0x1ed/0x200 [ceph]
[Sat Oct 20 15:11:40 2018]  ? nfsd_acceptable+0xa3/0xe0 [nfsd]
[Sat Oct 20 15:11:40 2018]  ? exportfs_decode_fh+0xd2/0x3e0
[Sat Oct 20 15:11:40 2018]  ? nfsd_proc_read+0x1a0/0x1a0 [nfsd]
[Sat Oct 20 15:11:40 2018]  do_iter_readv_writev+0x10b/0x170
[Sat Oct 20 15:11:40 2018]  do_iter_write+0x7f/0x190
[Sat Oct 20 15:11:40 2018]  vfs_iter_write+0x19/0x30
[Sat Oct 20 15:11:40 2018]  nfsd_vfs_write+0xc6/0x360 [nfsd]
[Sat Oct 20 15:11:40 2018]  nfsd4_write+0x1b8/0x260 [nfsd]
[Sat Oct 20 15:11:40 2018]  ? nfsd4_encode_operation+0x13f/0x1c0 [nfsd]
[Sat Oct 20 15:11:40 2018]  nfsd4_proc_compound+0x3e0/0x810 [nfsd]
[Sat Oct 20 15:11:40 2018]  nfsd_dispatch+0xc9/0x2f0 [nfsd]
[Sat Oct 20 15:11:40 2018]  svc_process_common+0x385/0x710 [sunrpc]
[Sat Oct 20 15:11:40 2018]  svc_process+0xfd/0x1c0 [sunrpc]
[Sat Oct 20 15:11:40 2018]  nfsd+0xf3/0x190 [nfsd]
[Sat Oct 20 15:11:40 2018]  kthread+0x109/0x140
[Sat Oct 20 15:11:40 2018]  ? nfsd_destroy+0x60/0x60 [nfsd]
[Sat Oct 20 15:11:40 2018]  ? kthread_park+0x60/0x60
[Sat Oct 20 15:11:40 2018]  ret_from_fork+0x25/0x30
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ubuntu 16.04 failed to connect to socket /com/ubuntu/upstart connection refused

2018-10-22 Thread Liu, Changcheng
Hi all,
 I follow below guide to deploy ceph mimic packages on ubuntu 16.04 host.
 http://docs.ceph.com/docs/master/start/quick-ceph-deploy/

I always hit below problem:
[DEBUG ] Setting up ceph-mon (13.2.2-1xenial) ...
[DEBUG ] start: Unable to connect to Upstart: Failed to connect to socket 
/com/ubuntu/upstart: Connection refused

Does anyone how to resolve the problem?

B.R.
Changcheng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client blocks when removing large files

2018-10-22 Thread Yan, Zheng
On Mon, Oct 22, 2018 at 7:47 PM Dylan McCulloch  wrote:
>
> > On Mon, Oct 22, 2018 at 2:37 PM Dylan McCulloch  
> > wrote:
> > >
> > > >
> > > > On Mon, Oct 22, 2018 at 9:46 AM Dylan McCulloch  
> > > > wrote:
> > > > >
> > > > > On Mon, Oct 8, 2018 at 2:57 PM Dylan McCulloch  > > > > unimelb.edu.au>> wrote:
> > > > > >>
> > > > > >> Hi all,
> > > > > >>
> > > > > >>
> > > > > >> We have identified some unexpected blocking behaviour by the 
> > > > > >> ceph-fs kernel client.
> > > > > >>
> > > > > >>
> > > > > >> When performing 'rm' on large files (100+GB), there appears to be 
> > > > > >> a significant delay of 10 seconds or more, before a 'stat' 
> > > > > >> operation can be performed on the same directory on the filesystem.
> > > > > >>
> > > > > >>
> > > > > >> Looking at the kernel client's mds inflight-ops, we observe that 
> > > > > >> there are pending
> > > > > >>
> > > > > >> UNLINK operations corresponding to the deleted files.
> > > > > >>
> > > > > >>
> > > > > >> We have noted some correlation between files being in the client 
> > > > > >> page cache and the blocking behaviour. For example, if the cache 
> > > > > >> is dropped or the filesystem remounted the blocking will not occur.
> > > > > >>
> > > > > >>
> > > > > >> Test scenario below:
> > > > > >>
> > > > > >>
> > > > > >> /mnt/cephfs_mountpoint type ceph 
> > > > > >> (rw,relatime,name=ceph_filesystem,secret=>,noshare,acl,wsize=16777216,rasize=268439552,caps_wanted_delay_min=1,caps_wanted_delay_max=1)
> > > > > >>
> > > > > >>
> > > > > >> Test1:
> > > > > >>
> > > > > >> 1) unmount & remount:
> > > > > >>
> > > > > >>
> > > > > >> 2) Add 10 x 100GB files to a directory:
> > > > > >>
> > > > > >>
> > > > > >> for i in {1..10}; do dd if=/dev/zero 
> > > > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > > > >>
> > > > > >>
> > > > > >> 3) Delete all files in directory:
> > > > > >>
> > > > > >>
> > > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > > > >>
> > > > > >>
> > > > > >> 4) Immediately perform ls on directory:
> > > > > >>
> > > > > >>
> > > > > >> time ls /mnt/cephfs_mountpoint/test1
> > > > > >>
> > > > > >>
> > > > > >> Result: delay ~16 seconds
> > > > > >>
> > > > > >> real0m16.818s
> > > > > >>
> > > > > >> user0m0.000s
> > > > > >>
> > > > > >> sys 0m0.002s
> > > > > >>
> > > > > >>
> > > > >
> > > > > > Are cephfs metadata pool and data pool on the same set of OSD. Is is
> > > > > > possible that heavy data IO slowed down metadata IO?
> > > > >
> > > > > Test results are from a new pre-production cluster that does not have 
> > > > > any significant data IO. We've also confirmed the same behaviour on 
> > > > > another cluster with similar configuration. Both clusters have 
> > > > > separate device-class/crush rule for metadata pool using NVME OSDs, 
> > > > > while the data pool uses HDD OSDs.
> > > > > Most metadata operations are unaffected. It appears that it is only 
> > > > > metadata operations on files that exist in client page cache prior to 
> > > > > rm that are delayed.
> > > > >
> > > >
> > > > Ok. Please enable kernel debug when running 'ls' and send kernel log to 
> > > > us.
> > > >
> > > > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;
> > > > time /mnt/cephfs_mountpoint/test1
> > > > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;
> > > >
> > > > Yan, Zheng
> > >
> > > Thanks Yan, Zheng
> > > I've attached two logfiles as I ran the test twice.
> > > The first time as previously described Test1 - cephfs_kern.log
> > > The second time I dropped caches prior to rm as in previous Test2 - 
> > > cephfs_drop_caches_kern.log
> > >
> >
> > The log shows that client waited 16 seconds for readdir reply. please
> > try again with debug mds/ms enabled and send both kerne log and mds
> > log to us.
> >
> > before writing data to files, enable debug_mds and debug_ms (On the
> > machine where mds.0 runs, run 'ceph daemon mds.x config set debug_mds
> > 10; ceph daemon mds.x config set debug_ms 1')
> > ...
> > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control
> > time ls /mnt/cephfs_mountpoint/test1
> > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control
> > disable debug_mds and debug_ms
> >
> > Yan, Zheng
>
> tarball of kernel log and mds debug log uploaded:
> https://swift.rc.nectar.org.au:/v1/AUTH_42/cephfs_issue/mds_debug_kern_logs_20181022_2141.tar.gz?temp_url_sig=51f74f07c77346138a164ed229dc8a92f18bed8d_url_expires=1545046086
>
> Thanks,
> Dylan
>

The log shows that mds sent reply immediately after receiving readdir
request. But the reply message was delayed for 16 seconds. (mds sent 5
messages at 2018-10-22 21:39:12, the last one is readdir reply.  The
kclient received the first message at 18739.612013 and received the
readdir reply at 18755.894441). The delay pattern is that
kclient received a message, nothing happened for 4 seconds, received
another one or two messages, 

Re: [ceph-users] cephfs kernel client blocks when removing large files

2018-10-22 Thread Dylan McCulloch
> On Mon, Oct 22, 2018 at 2:37 PM Dylan McCulloch  wrote:
> >
> > >
> > > On Mon, Oct 22, 2018 at 9:46 AM Dylan McCulloch  
> > > wrote:
> > > >
> > > > On Mon, Oct 8, 2018 at 2:57 PM Dylan McCulloch > 
> > > > wrote:
> > > > >>
> > > > >> Hi all,
> > > > >>
> > > > >>
> > > > >> We have identified some unexpected blocking behaviour by the ceph-fs 
> > > > >> kernel client.
> > > > >>
> > > > >>
> > > > >> When performing 'rm' on large files (100+GB), there appears to be a 
> > > > >> significant delay of 10 seconds or more, before a 'stat' operation 
> > > > >> can be performed on the same directory on the filesystem.
> > > > >>
> > > > >>
> > > > >> Looking at the kernel client's mds inflight-ops, we observe that 
> > > > >> there are pending
> > > > >>
> > > > >> UNLINK operations corresponding to the deleted files.
> > > > >>
> > > > >>
> > > > >> We have noted some correlation between files being in the client 
> > > > >> page cache and the blocking behaviour. For example, if the cache is 
> > > > >> dropped or the filesystem remounted the blocking will not occur.
> > > > >>
> > > > >>
> > > > >> Test scenario below:
> > > > >>
> > > > >>
> > > > >> /mnt/cephfs_mountpoint type ceph 
> > > > >> (rw,relatime,name=ceph_filesystem,secret=>,noshare,acl,wsize=16777216,rasize=268439552,caps_wanted_delay_min=1,caps_wanted_delay_max=1)
> > > > >>
> > > > >>
> > > > >> Test1:
> > > > >>
> > > > >> 1) unmount & remount:
> > > > >>
> > > > >>
> > > > >> 2) Add 10 x 100GB files to a directory:
> > > > >>
> > > > >>
> > > > >> for i in {1..10}; do dd if=/dev/zero 
> > > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > > >>
> > > > >>
> > > > >> 3) Delete all files in directory:
> > > > >>
> > > > >>
> > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > > >>
> > > > >>
> > > > >> 4) Immediately perform ls on directory:
> > > > >>
> > > > >>
> > > > >> time ls /mnt/cephfs_mountpoint/test1
> > > > >>
> > > > >>
> > > > >> Result: delay ~16 seconds
> > > > >>
> > > > >> real0m16.818s
> > > > >>
> > > > >> user0m0.000s
> > > > >>
> > > > >> sys 0m0.002s
> > > > >>
> > > > >>
> > > >
> > > > > Are cephfs metadata pool and data pool on the same set of OSD. Is is
> > > > > possible that heavy data IO slowed down metadata IO?
> > > >
> > > > Test results are from a new pre-production cluster that does not have 
> > > > any significant data IO. We've also confirmed the same behaviour on 
> > > > another cluster with similar configuration. Both clusters have separate 
> > > > device-class/crush rule for metadata pool using NVME OSDs, while the 
> > > > data pool uses HDD OSDs.
> > > > Most metadata operations are unaffected. It appears that it is only 
> > > > metadata operations on files that exist in client page cache prior to 
> > > > rm that are delayed.
> > > >
> > >
> > > Ok. Please enable kernel debug when running 'ls' and send kernel log to 
> > > us.
> > >
> > > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;
> > > time /mnt/cephfs_mountpoint/test1
> > > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;
> > >
> > > Yan, Zheng
> >
> > Thanks Yan, Zheng
> > I've attached two logfiles as I ran the test twice.
> > The first time as previously described Test1 - cephfs_kern.log
> > The second time I dropped caches prior to rm as in previous Test2 - 
> > cephfs_drop_caches_kern.log
> >
> 
> The log shows that client waited 16 seconds for readdir reply. please
> try again with debug mds/ms enabled and send both kerne log and mds
> log to us.
> 
> before writing data to files, enable debug_mds and debug_ms (On the
> machine where mds.0 runs, run 'ceph daemon mds.x config set debug_mds
> 10; ceph daemon mds.x config set debug_ms 1')
> ...
> echo module ceph +p > /sys/kernel/debug/dynamic_debug/control
> time ls /mnt/cephfs_mountpoint/test1
> echo module ceph -p > /sys/kernel/debug/dynamic_debug/control
> disable debug_mds and debug_ms
> 
> Yan, Zheng

tarball of kernel log and mds debug log uploaded:
https://swift.rc.nectar.org.au:/v1/AUTH_42/cephfs_issue/mds_debug_kern_logs_20181022_2141.tar.gz?temp_url_sig=51f74f07c77346138a164ed229dc8a92f18bed8d_url_expires=1545046086

Thanks,
Dylan

> >
> > > > >>
> > > > >> Test2:
> > > > >>
> > > > >>
> > > > >> 1) unmount & remount
> > > > >>
> > > > >>
> > > > >> 2) Add 10 x 100GB files to a directory
> > > > >>
> > > > >> for i in {1..10}; do dd if=/dev/zero 
> > > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > > >>
> > > > >>
> > > > >> 3) Either a) unmount & remount; or b) drop caches
> > > > >>
> > > > >>
> > > > >> echo 3 >>/proc/sys/vm/drop_caches
> > > > >>
> > > > >>
> > > > >> 4) Delete files in directory:
> > > > >>
> > > > >>
> > > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > > >>
> > > > >>
> > > > >> 5) Immediately perform ls on directory:
> > > > >>
> > > > >>
> > > > >> time ls /mnt/cephfs_mountpoint/test1
> 

Re: [ceph-users] safe to remove leftover bucket index objects

2018-10-22 Thread Luis Periquito
It may be related to http://tracker.ceph.com/issues/34307 - I have a
cluster whose OMAP size is larger than the stored data...
On Mon, Oct 22, 2018 at 11:09 AM Wido den Hollander  wrote:
>
>
>
> On 8/31/18 5:31 PM, Dan van der Ster wrote:
> > So it sounds like you tried what I was going to do, and it broke
> > things. Good to know... thanks.
> >
> > In our case, what triggered the extra index objects was a user running
> > PUT /bucketname/ around 20 million times -- this apparently recreates
> > the index objects.
> >
>
> I'm asking the same!
>
> Large omap object found. Object:
> 6:199f36b7:::.dir.ea087a7e-cb26-420f-9717-a98080b0623c.134167.15.1:head
> Key count: 5374754 Size (bytes): 1366279268
>
> In this case I can't find '134167.15.1' in any of the buckets when I do:
>
> for BUCKET in $(radosgw-admin metadata bucket list|jq -r '.[]'); do
> radosgw-admin metadata get bucket:$BUCKET > bucket.$BUCKET
> done
>
> If I grep through all the bucket.* files this object isn't showing up
> anywhere.
>
> Before I remove the object I want to make sure that it's safe to delete it.
>
> A garbage collector for the bucket index pools would be very great to have.
>
> Wido
>
> > -- dan
> >
> > On Thu, Aug 30, 2018 at 7:20 PM David Turner  wrote:
> >>
> >> I'm glad you asked this, because it was on my to-do list. I know that 
> >> based on our not existing in the bucket marker does not mean it's safe to 
> >> delete.  I have an index pool with 22k objects in it. 70 objects match 
> >> existing bucket markers. I was having a problem on the cluster and started 
> >> deleting the objects in the index pool and after going through 200 objects 
> >> I stopped it and tested and list access to 3 pools. Luckily for me they 
> >> were all buckets I've been working on deleting, so no need for recovery.
> >>
> >> I then compared bucket IDs to the objects in that pool, but still only 
> >> found a couple hundred more matching objects. I have no idea what the 
> >> other 22k objects are in the index bucket that don't match bucket markers 
> >> or bucket IDs. I did confirm there was no resharding happening both in the 
> >> research list and all bucket reshard statuses.
> >>
> >> Does anyone know how to parse the names of these objects and how to tell 
> >> what can be deleted?  This is if particular interest as I have another 
> >> costed with 1M injects in the index pool.
> >>
> >> On Thu, Aug 30, 2018, 7:29 AM Dan van der Ster  wrote:
> >>>
> >>> Replying to self...
> >>>
> >>> On Wed, Aug 1, 2018 at 11:56 AM Dan van der Ster  
> >>> wrote:
> 
>  Dear rgw friends,
> 
>  Somehow we have more than 20 million objects in our
>  default.rgw.buckets.index pool.
>  They are probably leftover from this issue we had last year:
>  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html
>  and we want to clean the leftover / unused index objects
> 
>  To do this, I would rados ls the pool, get a list of all existing
>  buckets and their current marker, then delete any objects with an
>  unused marker.
>  Does that sound correct?
> >>>
> >>> More precisely, for example, there is an object
> >>> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 in the index
> >>> pool.
> >>> I run `radosgw-admin bucket stats` to get the marker for all current
> >>> existing buckets.
> >>> The marker 61c59385-085d-4caa-9070-63a3868dccb6.2978181.59 is not
> >>> mentioned in the bucket stats output.
> >>> Is it safe to rados rm 
> >>> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 ??
> >>>
> >>> Thanks in advance!
> >>>
> >>> -- dan
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
>  Can someone suggest a better way?
> 
>  Cheers, Dan
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Maged Mokhtar



If you specify a db on ssd and data on hdd and not explicitly specify a 
device for wal, wal will be placed on same ssd partition with db.
Placing only wal on ssd or creating separate devices for wal and db are 
less common setups.


/Maged

On 22/10/18 09:03, Fyodor Ustinov wrote:

Hi!

For sharing SSD between WAL and DB what should be placed on SSD? WAL or DB?

- Original Message -
From: "Maged Mokhtar" 
To: "ceph-users" 
Sent: Saturday, 20 October, 2018 20:05:44
Subject: Re: [ceph-users] Drive for Wal and Db

On 20/10/18 18:57, Robert Stanford wrote:




Our OSDs are BlueStore and are on regular hard drives. Each OSD has a partition 
on an SSD for its DB. Wal is on the regular hard drives. Should I move the wal 
to share the SSD with the DB?

Regards
R


___
ceph-users mailing list [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] [ 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

you should put wal on the faster device, wal and db could share the same ssd 
partition,

Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] safe to remove leftover bucket index objects

2018-10-22 Thread Wido den Hollander



On 8/31/18 5:31 PM, Dan van der Ster wrote:
> So it sounds like you tried what I was going to do, and it broke
> things. Good to know... thanks.
> 
> In our case, what triggered the extra index objects was a user running
> PUT /bucketname/ around 20 million times -- this apparently recreates
> the index objects.
> 

I'm asking the same!

Large omap object found. Object:
6:199f36b7:::.dir.ea087a7e-cb26-420f-9717-a98080b0623c.134167.15.1:head
Key count: 5374754 Size (bytes): 1366279268

In this case I can't find '134167.15.1' in any of the buckets when I do:

for BUCKET in $(radosgw-admin metadata bucket list|jq -r '.[]'); do
radosgw-admin metadata get bucket:$BUCKET > bucket.$BUCKET
done

If I grep through all the bucket.* files this object isn't showing up
anywhere.

Before I remove the object I want to make sure that it's safe to delete it.

A garbage collector for the bucket index pools would be very great to have.

Wido

> -- dan
> 
> On Thu, Aug 30, 2018 at 7:20 PM David Turner  wrote:
>>
>> I'm glad you asked this, because it was on my to-do list. I know that based 
>> on our not existing in the bucket marker does not mean it's safe to delete.  
>> I have an index pool with 22k objects in it. 70 objects match existing 
>> bucket markers. I was having a problem on the cluster and started deleting 
>> the objects in the index pool and after going through 200 objects I stopped 
>> it and tested and list access to 3 pools. Luckily for me they were all 
>> buckets I've been working on deleting, so no need for recovery.
>>
>> I then compared bucket IDs to the objects in that pool, but still only found 
>> a couple hundred more matching objects. I have no idea what the other 22k 
>> objects are in the index bucket that don't match bucket markers or bucket 
>> IDs. I did confirm there was no resharding happening both in the research 
>> list and all bucket reshard statuses.
>>
>> Does anyone know how to parse the names of these objects and how to tell 
>> what can be deleted?  This is if particular interest as I have another 
>> costed with 1M injects in the index pool.
>>
>> On Thu, Aug 30, 2018, 7:29 AM Dan van der Ster  wrote:
>>>
>>> Replying to self...
>>>
>>> On Wed, Aug 1, 2018 at 11:56 AM Dan van der Ster  
>>> wrote:

 Dear rgw friends,

 Somehow we have more than 20 million objects in our
 default.rgw.buckets.index pool.
 They are probably leftover from this issue we had last year:
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018565.html
 and we want to clean the leftover / unused index objects

 To do this, I would rados ls the pool, get a list of all existing
 buckets and their current marker, then delete any objects with an
 unused marker.
 Does that sound correct?
>>>
>>> More precisely, for example, there is an object
>>> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 in the index
>>> pool.
>>> I run `radosgw-admin bucket stats` to get the marker for all current
>>> existing buckets.
>>> The marker 61c59385-085d-4caa-9070-63a3868dccb6.2978181.59 is not
>>> mentioned in the bucket stats output.
>>> Is it safe to rados rm 
>>> .dir.61c59385-085d-4caa-9070-63a3868dccb6.2978181.59.8 ??
>>>
>>> Thanks in advance!
>>>
>>> -- dan
>>>
>>>
>>>
>>>
>>>
>>>
>>>
 Can someone suggest a better way?

 Cheers, Dan
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client blocks when removing large files

2018-10-22 Thread Yan, Zheng
On Mon, Oct 22, 2018 at 2:37 PM Dylan McCulloch  wrote:
>
> >
> > On Mon, Oct 22, 2018 at 9:46 AM Dylan McCulloch  wrote:
> > >
> > > On Mon, Oct 8, 2018 at 2:57 PM Dylan McCulloch > 
> > > wrote:
> > > >>
> > > >> Hi all,
> > > >>
> > > >>
> > > >> We have identified some unexpected blocking behaviour by the ceph-fs 
> > > >> kernel client.
> > > >>
> > > >>
> > > >> When performing 'rm' on large files (100+GB), there appears to be a 
> > > >> significant delay of 10 seconds or more, before a 'stat' operation can 
> > > >> be performed on the same directory on the filesystem.
> > > >>
> > > >>
> > > >> Looking at the kernel client's mds inflight-ops, we observe that there 
> > > >> are pending
> > > >>
> > > >> UNLINK operations corresponding to the deleted files.
> > > >>
> > > >>
> > > >> We have noted some correlation between files being in the client page 
> > > >> cache and the blocking behaviour. For example, if the cache is dropped 
> > > >> or the filesystem remounted the blocking will not occur.
> > > >>
> > > >>
> > > >> Test scenario below:
> > > >>
> > > >>
> > > >> /mnt/cephfs_mountpoint type ceph 
> > > >> (rw,relatime,name=ceph_filesystem,secret=>,noshare,acl,wsize=16777216,rasize=268439552,caps_wanted_delay_min=1,caps_wanted_delay_max=1)
> > > >>
> > > >>
> > > >> Test1:
> > > >>
> > > >> 1) unmount & remount:
> > > >>
> > > >>
> > > >> 2) Add 10 x 100GB files to a directory:
> > > >>
> > > >>
> > > >> for i in {1..10}; do dd if=/dev/zero 
> > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > >>
> > > >>
> > > >> 3) Delete all files in directory:
> > > >>
> > > >>
> > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > >>
> > > >>
> > > >> 4) Immediately perform ls on directory:
> > > >>
> > > >>
> > > >> time ls /mnt/cephfs_mountpoint/test1
> > > >>
> > > >>
> > > >> Result: delay ~16 seconds
> > > >>
> > > >> real0m16.818s
> > > >>
> > > >> user0m0.000s
> > > >>
> > > >> sys 0m0.002s
> > > >>
> > > >>
> > >
> > > > Are cephfs metadata pool and data pool on the same set of OSD. Is is
> > > > possible that heavy data IO slowed down metadata IO?
> > >
> > > Test results are from a new pre-production cluster that does not have any 
> > > significant data IO. We've also confirmed the same behaviour on another 
> > > cluster with similar configuration. Both clusters have separate 
> > > device-class/crush rule for metadata pool using NVME OSDs, while the data 
> > > pool uses HDD OSDs.
> > > Most metadata operations are unaffected. It appears that it is only 
> > > metadata operations on files that exist in client page cache prior to rm 
> > > that are delayed.
> > >
> >
> > Ok. Please enable kernel debug when running 'ls' and send kernel log to us.
> >
> > echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;
> > time /mnt/cephfs_mountpoint/test1
> > echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;
> >
> > Yan, Zheng
>
> Thanks Yan, Zheng
> I've attached two logfiles as I ran the test twice.
> The first time as previously described Test1 - cephfs_kern.log
> The second time I dropped caches prior to rm as in previous Test2 - 
> cephfs_drop_caches_kern.log
>

The log shows that client waited 16 seconds for readdir reply. please
try again with debug mds/ms enabled and send both kerne log and mds
log to us.

before writing data to files, enable debug_mds and debug_ms (On the
machine where mds.0 runs, run 'ceph daemon mds.x config set debug_mds
10; ceph daemon mds.x config set debug_ms 1')
...
echo module ceph +p > /sys/kernel/debug/dynamic_debug/control
time ls /mnt/cephfs_mountpoint/test1
echo module ceph -p > /sys/kernel/debug/dynamic_debug/control
disable debug_mds and debug_ms

Yan, Zheng

>
> > > >>
> > > >> Test2:
> > > >>
> > > >>
> > > >> 1) unmount & remount
> > > >>
> > > >>
> > > >> 2) Add 10 x 100GB files to a directory
> > > >>
> > > >> for i in {1..10}; do dd if=/dev/zero 
> > > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > > >>
> > > >>
> > > >> 3) Either a) unmount & remount; or b) drop caches
> > > >>
> > > >>
> > > >> echo 3 >>/proc/sys/vm/drop_caches
> > > >>
> > > >>
> > > >> 4) Delete files in directory:
> > > >>
> > > >>
> > > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > > >>
> > > >>
> > > >> 5) Immediately perform ls on directory:
> > > >>
> > > >>
> > > >> time ls /mnt/cephfs_mountpoint/test1
> > > >>
> > > >>
> > > >> Result: no delay
> > > >>
> > > >> real0m0.010s
> > > >>
> > > >> user0m0.000s
> > > >>
> > > >> sys 0m0.001s
> > > >>
> > > >>
> > > >> Our understanding of ceph-fs’ file deletion mechanism, is that there 
> > > >> should be no blocking observed on the client. 
> > > >> http://docs.ceph.com/docs/mimic/dev/delayed-delete/ .
> > > >>
> > > >> It appears that if files are cached on the client, either by being 
> > > >> created or accessed recently  it will cause the kernel client to block 
> > > >> for 

Re: [ceph-users] Drive for Wal and Db

2018-10-22 Thread Fyodor Ustinov
Hi!

For sharing SSD between WAL and DB what should be placed on SSD? WAL or DB?

- Original Message -
From: "Maged Mokhtar" 
To: "ceph-users" 
Sent: Saturday, 20 October, 2018 20:05:44
Subject: Re: [ceph-users] Drive for Wal and Db

On 20/10/18 18:57, Robert Stanford wrote: 




Our OSDs are BlueStore and are on regular hard drives. Each OSD has a partition 
on an SSD for its DB. Wal is on the regular hard drives. Should I move the wal 
to share the SSD with the DB? 

Regards 
R 


___
ceph-users mailing list [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] [ 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

you should put wal on the faster device, wal and db could share the same ssd 
partition, 

Maged 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is rgw.none

2018-10-22 Thread Janne Johansson
Den mån 6 aug. 2018 kl 12:58 skrev Tomasz Płaza :

> Hi all,
>
> I have a bucket with a vary big num_objects in rgw.none:
>
> {
> "bucket": "dyna",
>
> "usage": {
> "rgw.none": {
>
> "num_objects": 18446744073709551615
> }
>
> What is rgw.none and is this big number OK?
>
That number is exactly -1 for a 64-bit integer, so -1 might either be some
kind of "we accidentally reduced 1 from 0" code bug or just a marker with
an "impossible" number telling some other code part this is special.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client blocks when removing large files

2018-10-22 Thread Dylan McCulloch
>
> On Mon, Oct 22, 2018 at 9:46 AM Dylan McCulloch  wrote:
> >
> > On Mon, Oct 8, 2018 at 2:57 PM Dylan McCulloch > wrote:
> > >>
> > >> Hi all,
> > >>
> > >>
> > >> We have identified some unexpected blocking behaviour by the ceph-fs 
> > >> kernel client.
> > >>
> > >>
> > >> When performing 'rm' on large files (100+GB), there appears to be a 
> > >> significant delay of 10 seconds or more, before a 'stat' operation can 
> > >> be performed on the same directory on the filesystem.
> > >>
> > >>
> > >> Looking at the kernel client's mds inflight-ops, we observe that there 
> > >> are pending
> > >>
> > >> UNLINK operations corresponding to the deleted files.
> > >>
> > >>
> > >> We have noted some correlation between files being in the client page 
> > >> cache and the blocking behaviour. For example, if the cache is dropped 
> > >> or the filesystem remounted the blocking will not occur.
> > >>
> > >>
> > >> Test scenario below:
> > >>
> > >>
> > >> /mnt/cephfs_mountpoint type ceph 
> > >> (rw,relatime,name=ceph_filesystem,secret=>,noshare,acl,wsize=16777216,rasize=268439552,caps_wanted_delay_min=1,caps_wanted_delay_max=1)
> > >>
> > >>
> > >> Test1:
> > >>
> > >> 1) unmount & remount:
> > >>
> > >>
> > >> 2) Add 10 x 100GB files to a directory:
> > >>
> > >>
> > >> for i in {1..10}; do dd if=/dev/zero 
> > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > >>
> > >>
> > >> 3) Delete all files in directory:
> > >>
> > >>
> > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > >>
> > >>
> > >> 4) Immediately perform ls on directory:
> > >>
> > >>
> > >> time ls /mnt/cephfs_mountpoint/test1
> > >>
> > >>
> > >> Result: delay ~16 seconds
> > >>
> > >> real0m16.818s
> > >>
> > >> user0m0.000s
> > >>
> > >> sys 0m0.002s
> > >>
> > >>
> >
> > > Are cephfs metadata pool and data pool on the same set of OSD. Is is
> > > possible that heavy data IO slowed down metadata IO?
> >
> > Test results are from a new pre-production cluster that does not have any 
> > significant data IO. We've also confirmed the same behaviour on another 
> > cluster with similar configuration. Both clusters have separate 
> > device-class/crush rule for metadata pool using NVME OSDs, while the data 
> > pool uses HDD OSDs.
> > Most metadata operations are unaffected. It appears that it is only 
> > metadata operations on files that exist in client page cache prior to rm 
> > that are delayed.
> >
>
> Ok. Please enable kernel debug when running 'ls' and send kernel log to us.
>
> echo module ceph +p > /sys/kernel/debug/dynamic_debug/control;
> time /mnt/cephfs_mountpoint/test1
> echo module ceph -p > /sys/kernel/debug/dynamic_debug/control;
>
> Yan, Zheng

Thanks Yan, Zheng
I've attached two logfiles as I ran the test twice.
The first time as previously described Test1 - cephfs_kern.log
The second time I dropped caches prior to rm as in previous Test2 - 
cephfs_drop_caches_kern.log


> > >>
> > >> Test2:
> > >>
> > >>
> > >> 1) unmount & remount
> > >>
> > >>
> > >> 2) Add 10 x 100GB files to a directory
> > >>
> > >> for i in {1..10}; do dd if=/dev/zero 
> > >> of=/mnt/cephfs_mountpoint/file$i.txt count=102400 bs=1048576; done
> > >>
> > >>
> > >> 3) Either a) unmount & remount; or b) drop caches
> > >>
> > >>
> > >> echo 3 >>/proc/sys/vm/drop_caches
> > >>
> > >>
> > >> 4) Delete files in directory:
> > >>
> > >>
> > >> for i in {1..10};do rm -f /mnt/cephfs_mountpoint/file$i.txt; done
> > >>
> > >>
> > >> 5) Immediately perform ls on directory:
> > >>
> > >>
> > >> time ls /mnt/cephfs_mountpoint/test1
> > >>
> > >>
> > >> Result: no delay
> > >>
> > >> real0m0.010s
> > >>
> > >> user0m0.000s
> > >>
> > >> sys 0m0.001s
> > >>
> > >>
> > >> Our understanding of ceph-fs’ file deletion mechanism, is that there 
> > >> should be no blocking observed on the client. 
> > >> http://docs.ceph.com/docs/mimic/dev/delayed-delete/ .
> > >>
> > >> It appears that if files are cached on the client, either by being 
> > >> created or accessed recently  it will cause the kernel client to block 
> > >> for reasons we have not identified.
> > >>
> > >>
> > >> Is this a known issue, are there any ways to mitigate this behaviour?
> > >>
> > >> Our production system relies on our client’s processes having concurrent 
> > >> access to the file system, and access contention must be avoided.
> > >>
> > >>
> > >> An old mailing list post that discusses changes to client’s page cache 
> > >> behaviour may be relevant.
> > >>
> > >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005692.html
> > >>
> > >>
> > >> Client System:
> > >>
> > >>
> > >> OS: RHEL7
> > >>
> > >> Kernel: 4.15.15-1
> > >>
> > >>
> > >> Cluster: Ceph: Luminous 12.2.8
> > >>
> > >>
> >
> >
> >
> > >> Thanks,
> > >>
> > >> Dylan
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >>