[ceph-users] _setup_block_symlink_or_file failed to create block symlink to spdk:5780A001A5KD: (17) File exists

2018-04-27 Thread Yang, Liang
hi,

I am making enable spdk on ceph. I got the error below. Could someone could 
help me ? Thank you very much.

1. SPDK code will be compiled by default
if(CMAKE_SYSTEM_PROCESSOR MATCHES "i386|i686|amd64|x86_64|AMD64|aarch64")
option(WITH_SPDK "Enable SPDK" ON)
else()
option(WITH_SPDK "Enable SPDK" OFF)
endif()

2.bluestore_block_path = spdk:5780A001A5KD


3.

ceph-disk prepare --zap-disk --cluster ceph --cluster-uuid $ceph_fsid 
--bluestore /dev/nvme0n1

 ceph-disk activate /dev/nvme0n1p1   this step failed, the 
error information is as below

[root@ceph-rep-05 ceph-ansible-hxt-0417]# ceph-disk activate /dev/nvme0n1p1

/usr/lib/python2.7/site-packages/ceph_disk/main.py:5689: UserWarning:

***

This tool is now deprecated in favor of ceph-volume.

It is recommended to use ceph-volume for OSD deployments. For details see:



http://docs.ceph.com/docs/master/ceph-volume/#migrating



***



  warnings.warn(DEPRECATION_WARNING)

got monmap epoch 1

2018-04-26 17:57:21.897 a409 -1 bluestore(/var/lib/ceph/tmp/mnt.5lt4X5) 
_setup_block_symlink_or_file failed to create block symlink to 
spdk:5780A001A5KD: (17) File exists

2018-04-26 17:57:21.897 a409 -1 bluestore(/var/lib/ceph/tmp/mnt.5lt4X5) 
mkfs failed, (17) File exists

2018-04-26 17:57:21.897 a409 -1 OSD::mkfs: ObjectStore::mkfs failed 
with error (17) File exists

2018-04-26 17:57:21.897 a409 -1  ** ERROR: error creating empty object 
store in /var/lib/ceph/tmp/mnt.5lt4X5: (17) File exists

mount_activate: Failed to activate

Traceback (most recent call last):

  File "/sbin/ceph-disk", line 11, in 

load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5772, in run

main(sys.argv[1:])

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5718, in main

main_catch(args.func, args)

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5746, in 
main_catch

func(args)

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3796, in 
main_activate

reactivate=args.reactivate,

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3559, in 
mount_activate

(osd_id, cluster) = activate(path, activate_key_template, init)

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3736, in 
activate

keyring=keyring,

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3187, in mkfs

'--setgroup', get_ceph_group(),

  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 577, in 
command_check_call

return subprocess.check_call(arguments)

  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call

raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command '['/usr/bin/ceph-osd', 
'--no-mon-config', '--cluster', 'ceph', '--mkfs', '-i', u'0', '--monmap', 
'/var/lib/ceph/tmp/mnt.5lt4X5/activate.monmap', '--osd-data', 
'/var/lib/ceph/tmp/mnt.5lt4X5', '--osd-uuid', 
u'8683718d-0734-4043-827c-3d1ec4f65422', '--setuser', 'ceph', '--setgroup', 
'ceph']' returned non-zero exit status 250

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent metadata seen by CephFS-fuse clients

2018-04-27 Thread Yan, Zheng
On Sat, Apr 28, 2018 at 10:25 AM, Oliver Freyermuth
 wrote:
> Am 28.04.2018 um 03:55 schrieb Yan, Zheng:
>> On Fri, Apr 27, 2018 at 11:49 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Yan Zheng,
>>>
>>> Am 27.04.2018 um 15:32 schrieb Yan, Zheng:
 On Fri, Apr 27, 2018 at 7:10 PM, Oliver Freyermuth
  wrote:
> Dear Yan Zheng,
>
> Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
>> On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Cephalopodians,
>>>
>>> just now that our Ceph cluster is under high I/O load, we get user 
>>> reports of files not being seen on some clients,
>>> but somehow showing up after forcing a stat() syscall.
>>>
>>> For example, one user had added several files to a directory via an NFS 
>>> client attached to nfs-ganesha (which uses libcephfs),
>>> and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
>>> Fuse-clients -
>>> but one single client still saw the old contents of the directory, i.e. 
>>> the files seemed missing(!).
>>> This happened both when using "ls" on the directory or when trying to 
>>> access the non-existent files directly.
>>>
>>> I could confirm this observation also in a fresh login shell on the 
>>> machine.
>>>
>>> Then, on the "broken" client, I entered in the directory which seemed 
>>> to contain only the "old" content, and I created a new file in there.
>>> This worked fine, and all other clients saw the file immediately.
>>> Also on the broken client, metadata was now updated and all other files 
>>> appeared - i.e. everything was "in sync" again.
>>>
>>> There's nothing in the ceph-logs of our MDS, or in the syslogs of the 
>>> client machine / MDS.
>>>
>>>
>>> Another user observed the same, but not explicitly limited to one 
>>> machine (it seems random).
>>> He now uses a "stat" on the file he expects to exist (but which is not 
>>> seen with "ls").
>>> The stat returns "No such file", a subsequent "ls" then however lists 
>>> the file, and it can be accessed normally.
>>>
>>> This feels like something is messed up concerning the client caps - 
>>> these are all 12.2.4 Fuse clients.
>>>
>>> Any ideas how to find the cause?
>>> It only happens since recently, and under high I/O load with many 
>>> metadata operations.
>>>
>>
>> Sounds like bug in readdir cache. Could you try the attached patch.
>
> Many thanks for the quick response and patch!
> The problem is to try it out. We only observe this issue on our 
> production cluster, randomly, especially during high load, and only after 
> is has been running for a few days.
> We don't have a test Ceph cluster available of similar size and with 
> similar load. I would not like to try out the patch on our production 
> system.
>
> Can you extrapolate from the bugfix / patch what's the minimal setup 
> needed to reproduce / trigger the issue?
> Then we may look into setting up a minimal test setup to check whether 
> the issue is resolved.
>
> All the best and many thanks,
> Oliver
>

 I think this is libcephfs version of
 http://tracker.ceph.com/issues/20467. I forgot to write patch for
 libcephfs, Sorry. To reproduce this,  write a program that call
 getdents(2) in a loop. Add artificially delay to the loop, make the
 program iterates whole directory in about ten seconds. Run several
 instance of the program simultaneously on a large directory. Also make
 client_cache_size a little smaller than the size of directory.
>>>
>>> This is strange - in case 1 where our users observed the issue,
>>> the affected directory contained exactly 1 file, which some clients saw and 
>>> others did not.
>>> In case 2, the affected directory contained about 5 files only.
>>>
>>> Of course, we also have directories with many (thousands) of files in our 
>>> CephFS, and they may be accessed in parallel.
>>> Also, we run a massive number of parallel programs (about 2000) accessing 
>>> the FS via about 40 clients.
>>>
>>> 1. Could this still be the same issue?
>>> 2. Many thanks for the repro-instructions. It seems, however, this would 
>>> require quite an amount of time,
>>>since we don't have a separate "test" instance at hand (yet) and are not 
>>> experts on the field.
>>>We could try, but it won't be fast... And meybe it's nicer to have 
>>> something like this in the test suite, if possible.
>>>
>>> Potentially, it's even faster to get the fix in the next patch release, if 
>>> it's clear this can not have bad side effects.
>>>
>>> Also, should we transfer this information to a ticket?
>>>
>>> Cheers and many thanks,
>>> Oliver
>>>
>>
>> I found an 

Re: [ceph-users] Inconsistent metadata seen by CephFS-fuse clients

2018-04-27 Thread Oliver Freyermuth
Am 28.04.2018 um 03:55 schrieb Yan, Zheng:
> On Fri, Apr 27, 2018 at 11:49 PM, Oliver Freyermuth
>  wrote:
>> Dear Yan Zheng,
>>
>> Am 27.04.2018 um 15:32 schrieb Yan, Zheng:
>>> On Fri, Apr 27, 2018 at 7:10 PM, Oliver Freyermuth
>>>  wrote:
 Dear Yan Zheng,

 Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
> On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> just now that our Ceph cluster is under high I/O load, we get user 
>> reports of files not being seen on some clients,
>> but somehow showing up after forcing a stat() syscall.
>>
>> For example, one user had added several files to a directory via an NFS 
>> client attached to nfs-ganesha (which uses libcephfs),
>> and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
>> Fuse-clients -
>> but one single client still saw the old contents of the directory, i.e. 
>> the files seemed missing(!).
>> This happened both when using "ls" on the directory or when trying to 
>> access the non-existent files directly.
>>
>> I could confirm this observation also in a fresh login shell on the 
>> machine.
>>
>> Then, on the "broken" client, I entered in the directory which seemed to 
>> contain only the "old" content, and I created a new file in there.
>> This worked fine, and all other clients saw the file immediately.
>> Also on the broken client, metadata was now updated and all other files 
>> appeared - i.e. everything was "in sync" again.
>>
>> There's nothing in the ceph-logs of our MDS, or in the syslogs of the 
>> client machine / MDS.
>>
>>
>> Another user observed the same, but not explicitly limited to one 
>> machine (it seems random).
>> He now uses a "stat" on the file he expects to exist (but which is not 
>> seen with "ls").
>> The stat returns "No such file", a subsequent "ls" then however lists 
>> the file, and it can be accessed normally.
>>
>> This feels like something is messed up concerning the client caps - 
>> these are all 12.2.4 Fuse clients.
>>
>> Any ideas how to find the cause?
>> It only happens since recently, and under high I/O load with many 
>> metadata operations.
>>
>
> Sounds like bug in readdir cache. Could you try the attached patch.

 Many thanks for the quick response and patch!
 The problem is to try it out. We only observe this issue on our production 
 cluster, randomly, especially during high load, and only after is has been 
 running for a few days.
 We don't have a test Ceph cluster available of similar size and with 
 similar load. I would not like to try out the patch on our production 
 system.

 Can you extrapolate from the bugfix / patch what's the minimal setup 
 needed to reproduce / trigger the issue?
 Then we may look into setting up a minimal test setup to check whether the 
 issue is resolved.

 All the best and many thanks,
 Oliver

>>>
>>> I think this is libcephfs version of
>>> http://tracker.ceph.com/issues/20467. I forgot to write patch for
>>> libcephfs, Sorry. To reproduce this,  write a program that call
>>> getdents(2) in a loop. Add artificially delay to the loop, make the
>>> program iterates whole directory in about ten seconds. Run several
>>> instance of the program simultaneously on a large directory. Also make
>>> client_cache_size a little smaller than the size of directory.
>>
>> This is strange - in case 1 where our users observed the issue,
>> the affected directory contained exactly 1 file, which some clients saw and 
>> others did not.
>> In case 2, the affected directory contained about 5 files only.
>>
>> Of course, we also have directories with many (thousands) of files in our 
>> CephFS, and they may be accessed in parallel.
>> Also, we run a massive number of parallel programs (about 2000) accessing 
>> the FS via about 40 clients.
>>
>> 1. Could this still be the same issue?
>> 2. Many thanks for the repro-instructions. It seems, however, this would 
>> require quite an amount of time,
>>since we don't have a separate "test" instance at hand (yet) and are not 
>> experts on the field.
>>We could try, but it won't be fast... And meybe it's nicer to have 
>> something like this in the test suite, if possible.
>>
>> Potentially, it's even faster to get the fix in the next patch release, if 
>> it's clear this can not have bad side effects.
>>
>> Also, should we transfer this information to a ticket?
>>
>> Cheers and many thanks,
>> Oliver
>>
> 
> I found an issue in the code that handle session stale message. Steps
> to reproduce are at http://tracker.ceph.com/issues/23894.

Thanks, yes, this seems a lot more likely to be our issue - especially, those 

Re: [ceph-users] Inconsistent metadata seen by CephFS-fuse clients

2018-04-27 Thread Yan, Zheng
On Fri, Apr 27, 2018 at 11:49 PM, Oliver Freyermuth
 wrote:
> Dear Yan Zheng,
>
> Am 27.04.2018 um 15:32 schrieb Yan, Zheng:
>> On Fri, Apr 27, 2018 at 7:10 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Yan Zheng,
>>>
>>> Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
 On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
  wrote:
> Dear Cephalopodians,
>
> just now that our Ceph cluster is under high I/O load, we get user 
> reports of files not being seen on some clients,
> but somehow showing up after forcing a stat() syscall.
>
> For example, one user had added several files to a directory via an NFS 
> client attached to nfs-ganesha (which uses libcephfs),
> and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
> Fuse-clients -
> but one single client still saw the old contents of the directory, i.e. 
> the files seemed missing(!).
> This happened both when using "ls" on the directory or when trying to 
> access the non-existent files directly.
>
> I could confirm this observation also in a fresh login shell on the 
> machine.
>
> Then, on the "broken" client, I entered in the directory which seemed to 
> contain only the "old" content, and I created a new file in there.
> This worked fine, and all other clients saw the file immediately.
> Also on the broken client, metadata was now updated and all other files 
> appeared - i.e. everything was "in sync" again.
>
> There's nothing in the ceph-logs of our MDS, or in the syslogs of the 
> client machine / MDS.
>
>
> Another user observed the same, but not explicitly limited to one machine 
> (it seems random).
> He now uses a "stat" on the file he expects to exist (but which is not 
> seen with "ls").
> The stat returns "No such file", a subsequent "ls" then however lists the 
> file, and it can be accessed normally.
>
> This feels like something is messed up concerning the client caps - these 
> are all 12.2.4 Fuse clients.
>
> Any ideas how to find the cause?
> It only happens since recently, and under high I/O load with many 
> metadata operations.
>

 Sounds like bug in readdir cache. Could you try the attached patch.
>>>
>>> Many thanks for the quick response and patch!
>>> The problem is to try it out. We only observe this issue on our production 
>>> cluster, randomly, especially during high load, and only after is has been 
>>> running for a few days.
>>> We don't have a test Ceph cluster available of similar size and with 
>>> similar load. I would not like to try out the patch on our production 
>>> system.
>>>
>>> Can you extrapolate from the bugfix / patch what's the minimal setup needed 
>>> to reproduce / trigger the issue?
>>> Then we may look into setting up a minimal test setup to check whether the 
>>> issue is resolved.
>>>
>>> All the best and many thanks,
>>> Oliver
>>>
>>
>> I think this is libcephfs version of
>> http://tracker.ceph.com/issues/20467. I forgot to write patch for
>> libcephfs, Sorry. To reproduce this,  write a program that call
>> getdents(2) in a loop. Add artificially delay to the loop, make the
>> program iterates whole directory in about ten seconds. Run several
>> instance of the program simultaneously on a large directory. Also make
>> client_cache_size a little smaller than the size of directory.
>
> This is strange - in case 1 where our users observed the issue,
> the affected directory contained exactly 1 file, which some clients saw and 
> others did not.
> In case 2, the affected directory contained about 5 files only.
>
> Of course, we also have directories with many (thousands) of files in our 
> CephFS, and they may be accessed in parallel.
> Also, we run a massive number of parallel programs (about 2000) accessing the 
> FS via about 40 clients.
>
> 1. Could this still be the same issue?
> 2. Many thanks for the repro-instructions. It seems, however, this would 
> require quite an amount of time,
>since we don't have a separate "test" instance at hand (yet) and are not 
> experts on the field.
>We could try, but it won't be fast... And meybe it's nicer to have 
> something like this in the test suite, if possible.
>
> Potentially, it's even faster to get the fix in the next patch release, if 
> it's clear this can not have bad side effects.
>
> Also, should we transfer this information to a ticket?
>
> Cheers and many thanks,
> Oliver
>

I found an issue in the code that handle session stale message. Steps
to reproduce are at http://tracker.ceph.com/issues/23894.

Regards
Yan, Zheng

>>
>> Regards
>> Yan, Zheng
>>
>>>

 Regards
 Yan, Zheng


> Cheers,
> Oliver
>
>

> ___
> ceph-users mailing list
> 

Re: [ceph-users] Inconsistent metadata seen by CephFS-fuse clients

2018-04-27 Thread Yan, Zheng
On Fri, Apr 27, 2018 at 11:49 PM, Oliver Freyermuth
 wrote:
> Dear Yan Zheng,
>
> Am 27.04.2018 um 15:32 schrieb Yan, Zheng:
>> On Fri, Apr 27, 2018 at 7:10 PM, Oliver Freyermuth
>>  wrote:
>>> Dear Yan Zheng,
>>>
>>> Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
 On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
  wrote:
> Dear Cephalopodians,
>
> just now that our Ceph cluster is under high I/O load, we get user 
> reports of files not being seen on some clients,
> but somehow showing up after forcing a stat() syscall.
>
> For example, one user had added several files to a directory via an NFS 
> client attached to nfs-ganesha (which uses libcephfs),
> and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
> Fuse-clients -
> but one single client still saw the old contents of the directory, i.e. 
> the files seemed missing(!).
> This happened both when using "ls" on the directory or when trying to 
> access the non-existent files directly.
>
> I could confirm this observation also in a fresh login shell on the 
> machine.
>
> Then, on the "broken" client, I entered in the directory which seemed to 
> contain only the "old" content, and I created a new file in there.
> This worked fine, and all other clients saw the file immediately.
> Also on the broken client, metadata was now updated and all other files 
> appeared - i.e. everything was "in sync" again.
>
> There's nothing in the ceph-logs of our MDS, or in the syslogs of the 
> client machine / MDS.
>
>
> Another user observed the same, but not explicitly limited to one machine 
> (it seems random).
> He now uses a "stat" on the file he expects to exist (but which is not 
> seen with "ls").
> The stat returns "No such file", a subsequent "ls" then however lists the 
> file, and it can be accessed normally.
>
> This feels like something is messed up concerning the client caps - these 
> are all 12.2.4 Fuse clients.
>
> Any ideas how to find the cause?
> It only happens since recently, and under high I/O load with many 
> metadata operations.
>

 Sounds like bug in readdir cache. Could you try the attached patch.
>>>
>>> Many thanks for the quick response and patch!
>>> The problem is to try it out. We only observe this issue on our production 
>>> cluster, randomly, especially during high load, and only after is has been 
>>> running for a few days.
>>> We don't have a test Ceph cluster available of similar size and with 
>>> similar load. I would not like to try out the patch on our production 
>>> system.
>>>
>>> Can you extrapolate from the bugfix / patch what's the minimal setup needed 
>>> to reproduce / trigger the issue?
>>> Then we may look into setting up a minimal test setup to check whether the 
>>> issue is resolved.
>>>
>>> All the best and many thanks,
>>> Oliver
>>>
>>
>> I think this is libcephfs version of
>> http://tracker.ceph.com/issues/20467. I forgot to write patch for
>> libcephfs, Sorry. To reproduce this,  write a program that call
>> getdents(2) in a loop. Add artificially delay to the loop, make the
>> program iterates whole directory in about ten seconds. Run several
>> instance of the program simultaneously on a large directory. Also make
>> client_cache_size a little smaller than the size of directory.
>
> This is strange - in case 1 where our users observed the issue,
> the affected directory contained exactly 1 file, which some clients saw and 
> others did not.
> In case 2, the affected directory contained about 5 files only.
>
> Of course, we also have directories with many (thousands) of files in our 
> CephFS, and they may be accessed in parallel.
> Also, we run a massive number of parallel programs (about 2000) accessing the 
> FS via about 40 clients.
>
> 1. Could this still be the same issue?
> 2. Many thanks for the repro-instructions. It seems, however, this would 
> require quite an amount of time,
>since we don't have a separate "test" instance at hand (yet) and are not 
> experts on the field.
>We could try, but it won't be fast... And meybe it's nicer to have 
> something like this in the test suite, if possible.
>

This issue should be caused by different bug. I will check code.

Regards
Yan, Zheng


> Potentially, it's even faster to get the fix in the next patch release, if 
> it's clear this can not have bad side effects.
>
> Also, should we transfer this information to a ticket?
>
> Cheers and many thanks,
> Oliver
>
>>
>> Regards
>> Yan, Zheng
>>
>>>

 Regards
 Yan, Zheng


> Cheers,
> Oliver
>
>

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 

Re: [ceph-users] Deleting an rbd image hangs

2018-04-27 Thread Jason Dillaman
Do you have any reason for why the OSDs crash? Anything the logs? Can
you provide an "rbd info noc_tobedeleted"?

On Thu, Apr 26, 2018 at 9:24 AM, Jan Marquardt  wrote:
> Hi,
>
> I am currently trying to delete an rbd image which is seemingly causing
> our OSDs to crash, but it always gets stuck at 3%.
>
> root@ceph4:~# rbd rm noc_tobedeleted
> Removing image: 3% complete...
>
> Is there any way to force the deletion? Any other advices?
>
> Best Regards
>
> Jan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Where to place Block-DB?

2018-04-27 Thread David Turner
With filestore, if the NVMe actually died and you were unable to flush the
journal to the data part of the OSD, then you lost the full OSD as well.
That part hasn't changed at all from filestore to bluestore.  There have
been some other tickets on the ML here that talk about using `dd` to
replace a block-db disk before it fails.  This procedure works fine.

On Thu, Apr 26, 2018 at 11:28 PM Konstantin Shalygin  wrote:

> > With data located on the OSD (recovery) or as fresh formatted OSD?
> > Thank you.
>
>
> With bluestore NVMe frontend is a part of osd. When frontend dies -
> backend without db is a junk of bytes.
>
>
>
> k
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deleting an rbd image hangs

2018-04-27 Thread David Turner
This old [1] blog post about removing super large RBDs is not relevant if
you're using object map on the RBDs, however it's method to manually delete
an RBD is still valid.  You can see if this works for you to manually
remove the problem RBD you're having.

[1] http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image

On Thu, Apr 26, 2018 at 9:25 AM Jan Marquardt  wrote:

> Hi,
>
> I am currently trying to delete an rbd image which is seemingly causing
> our OSDs to crash, but it always gets stuck at 3%.
>
> root@ceph4:~# rbd rm noc_tobedeleted
> Removing image: 3% complete...
>
> Is there any way to force the deletion? Any other advices?
>
> Best Regards
>
> Jan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] trimming the MON level db

2018-04-27 Thread David Turner
I'm assuming that the "very bad move" means that you have some PGs not in
active+clean.  Any non-active+clean PG will prevent your mons from being
able to compact their db store.  This is by design so that if something
were to happen where the data on some of the copies of the PG were lost and
gone forever the mons could do their best to enable the cluster to
reconstruct the PG knowing when OSDs went down/up, when PGs moved to new
locations, etc.

Thankfully there isn't a way around this.  Something you can do is stop a
mon, move the /var/lib/mon/$(hostname -s)/ folder to a new disk with more
space, set it to mount in the proper location, and start it back up.  You
would want to do this for each mon to give them more room for the mon store
to grow.  Make sure to give the mon plenty of time to get back up into the
quorum before moving on to the next one.

On Wed, Apr 25, 2018 at 10:25 AM Luis Periquito  wrote:

> Hi all,
>
> we have a (really) big cluster that's ongoing a very bad move and the
> monitor database is growing at an alarming rate.
>
> The cluster is running jewel (10.2.7) and is there any way to trim the
> monitor database before it gets HEALTH_OK?
>
> I've searched and so far only found people saying not really, but just
> wanted a final sanity check...
>
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backup LUKS/Dmcrypt keys

2018-04-27 Thread David Turner
IIRC the dmcrypt keys in Jewel were moved to a partition on the OSD.  You
should be able to find the keys by mounting those partitions.  That is
assuming filestore.  I don't know where they are for bluestore.

On Wed, Apr 25, 2018 at 4:29 PM Kevin Olbrich  wrote:

> Hi,
>
> how can I backup the dmcrypt keys on luminous?
> The folder under /etc/ceph does not exist anymore.
>
> Kind regards
> Kevin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The mystery of sync modules

2018-04-27 Thread Janne Johansson
2018-04-27 17:33 GMT+02:00 Sean Purdy :
>
> Mimic has a new feature, a cloud sync module for radosgw to sync objects
> to some other S3-compatible destination.
>
> This would be a lovely thing to have here, and ties in nicely with object
> versioning and DR.  But I am put off by confusion and complexity with the
> whole multisite/realm/zone group/zone thing, and the docs aren't very
> forgiving, including a recommendation to delete all your data!
>
> Is there a straightforward way to set up the additional zone for a sync
> module with a preexisting bucket?  Whether it's the elasticsearch metadata
> search or the cloud replication, setting up sync modules on your *current*
> buckets must be a FAQ or at least frequently desired option.
>
>
Amen to "docs are very much lacking".

I tried to make buckets end up on different pools (== diff. disks) and had
to fumble around the docs to (try to) understand at which level to make the
split that makes some data end up and X and some at Y. The docs seemed to
me to be focused just on synching, and not on "place client Xs data here"
or "let client choose location" or something which would suit me. When we
upgraded from Hammer/Infernalis to Jewel we just made Zonegroups and stuff
to make S3 work at all, but no docs really tell me what the new separation
levels are for so that I may edit the right place.

The blogs that did somewhat point to my usecase was from when it still had
realms. 8-(


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-04-27 Thread Patrick Donnelly
On Thu, Apr 26, 2018 at 7:04 PM, Scottix  wrote:
> Ok let me try to explain this better, we are doing this back and forth and
> its not going anywhere. I'll just be as genuine as I can and explain the
> issue.
>
> What we are testing is a critical failure scenario and actually more of a
> real world scenario. Basically just what happens when it is 1AM and the shit
> hits the fan, half of your servers are down and 1 of the 3 MDS boxes are
> still alive.
> There is one very important fact that happens with CephFS and when the
> single Active MDS server fails. It is guaranteed 100% all IO is blocked. No
> split-brain, no corrupted data, 100% guaranteed ever since we started using
> CephFS
>
>
> Now with multi_mds, I understand this changes the logic and I understand how
> difficult and how hard this problem is, trust me I would not be able to
> tackle this. Basically I need to answer the question; what happens when 1 of
> 2 multi_mds fails with no standbys ready to come save them?
> What I have tested is not the same of a single active MDS; this absolutely
> changes the logic of what happens and how we troubleshoot. The CephFS is
> still alive and it does allow operations and does allow resources to go
> through. How, why and what is affected are very relevant questions if this
> is what the failure looks like since it is not 100% blocking.

Okay so now I understand what your real question is: what is the state
of CephFS when one or more ranks have failed but no standbys exist to
takeover? The answer is that there may be partial availability from
the up:active ranks which may hand out capabilities for the subtrees
they manage or no availability if that's not possible because it
cannot obtain the necessary locks.  No metadata is lost. No
inconsistency is created between clients. Full availability will be
restored when the lost ranks come back online.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent metadata seen by CephFS-fuse clients

2018-04-27 Thread Oliver Freyermuth
Dear Yan Zheng,

Am 27.04.2018 um 15:32 schrieb Yan, Zheng:
> On Fri, Apr 27, 2018 at 7:10 PM, Oliver Freyermuth
>  wrote:
>> Dear Yan Zheng,
>>
>> Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
>>> On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
>>>  wrote:
 Dear Cephalopodians,

 just now that our Ceph cluster is under high I/O load, we get user reports 
 of files not being seen on some clients,
 but somehow showing up after forcing a stat() syscall.

 For example, one user had added several files to a directory via an NFS 
 client attached to nfs-ganesha (which uses libcephfs),
 and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
 Fuse-clients -
 but one single client still saw the old contents of the directory, i.e. 
 the files seemed missing(!).
 This happened both when using "ls" on the directory or when trying to 
 access the non-existent files directly.

 I could confirm this observation also in a fresh login shell on the 
 machine.

 Then, on the "broken" client, I entered in the directory which seemed to 
 contain only the "old" content, and I created a new file in there.
 This worked fine, and all other clients saw the file immediately.
 Also on the broken client, metadata was now updated and all other files 
 appeared - i.e. everything was "in sync" again.

 There's nothing in the ceph-logs of our MDS, or in the syslogs of the 
 client machine / MDS.


 Another user observed the same, but not explicitly limited to one machine 
 (it seems random).
 He now uses a "stat" on the file he expects to exist (but which is not 
 seen with "ls").
 The stat returns "No such file", a subsequent "ls" then however lists the 
 file, and it can be accessed normally.

 This feels like something is messed up concerning the client caps - these 
 are all 12.2.4 Fuse clients.

 Any ideas how to find the cause?
 It only happens since recently, and under high I/O load with many metadata 
 operations.

>>>
>>> Sounds like bug in readdir cache. Could you try the attached patch.
>>
>> Many thanks for the quick response and patch!
>> The problem is to try it out. We only observe this issue on our production 
>> cluster, randomly, especially during high load, and only after is has been 
>> running for a few days.
>> We don't have a test Ceph cluster available of similar size and with similar 
>> load. I would not like to try out the patch on our production system.
>>
>> Can you extrapolate from the bugfix / patch what's the minimal setup needed 
>> to reproduce / trigger the issue?
>> Then we may look into setting up a minimal test setup to check whether the 
>> issue is resolved.
>>
>> All the best and many thanks,
>> Oliver
>>
> 
> I think this is libcephfs version of
> http://tracker.ceph.com/issues/20467. I forgot to write patch for
> libcephfs, Sorry. To reproduce this,  write a program that call
> getdents(2) in a loop. Add artificially delay to the loop, make the
> program iterates whole directory in about ten seconds. Run several
> instance of the program simultaneously on a large directory. Also make
> client_cache_size a little smaller than the size of directory.

This is strange - in case 1 where our users observed the issue,
the affected directory contained exactly 1 file, which some clients saw and 
others did not. 
In case 2, the affected directory contained about 5 files only. 

Of course, we also have directories with many (thousands) of files in our 
CephFS, and they may be accessed in parallel. 
Also, we run a massive number of parallel programs (about 2000) accessing the 
FS via about 40 clients. 

1. Could this still be the same issue? 
2. Many thanks for the repro-instructions. It seems, however, this would 
require quite an amount of time,
   since we don't have a separate "test" instance at hand (yet) and are not 
experts on the field. 
   We could try, but it won't be fast... And meybe it's nicer to have something 
like this in the test suite, if possible. 

Potentially, it's even faster to get the fix in the next patch release, if it's 
clear this can not have bad side effects. 

Also, should we transfer this information to a ticket? 

Cheers and many thanks,
Oliver

> 
> Regards
> Yan, Zheng
> 
>>
>>>
>>> Regards
>>> Yan, Zheng
>>>
>>>
 Cheers,
 Oliver


>>>
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] The mystery of sync modules

2018-04-27 Thread Sean Purdy
Hi,


Mimic has a new feature, a cloud sync module for radosgw to sync objects to 
some other S3-compatible destination.

This would be a lovely thing to have here, and ties in nicely with object 
versioning and DR.  But I am put off by confusion and complexity with the whole 
multisite/realm/zone group/zone thing, and the docs aren't very forgiving, 
including a recommendation to delete all your data!

Is there a straightforward way to set up the additional zone for a sync module 
with a preexisting bucket?  Whether it's the elasticsearch metadata search or 
the cloud replication, setting up sync modules on your *current* buckets must 
be a FAQ or at least frequently desired option.

Do I need a top-level realm?  I'm not actually using multisite for two 
clusters, I just want to use sync modules.  If I do, how do I transition my 
current default realm and RGW buckets?

Any blog posts to recommend?

It's not a huge cluster, but it does include production data.


Thanks,

Sean
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 12.2.5 - atop DB/WAL SSD usage 0%

2018-04-27 Thread Alan Johnson
Could we infer from this if the usage model is large object sizes  rather than 
small I/Os the benefit of offloading WAL/DB is questionable given that the 
failure of the SSD (assuming shared amongst HDDs) could take down a number of 
OSDs and in this case a best practice would be to collocate?

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Serkan 
Çoban
Sent: Friday, April 27, 2018 10:05 AM
To: Steven Vacaroaia 
Cc: ceph-users 
Subject: Re: [ceph-users] ceph 12.2.5 - atop DB/WAL SSD usage 0%

rados bench is using 4MB block size for io. Try with with io size 4KB, you will 
see ssd will be used for write operations.

On Fri, Apr 27, 2018 at 4:54 PM, Steven Vacaroaia  wrote:
> Hi
>
> During rados bench tests, I noticed that HDD usage goes to 100% but 
> SSD stays at ( or very close to 0)
>
> Since I created OSD with BLOCK/WAL on SSD, shouldnt  I see some "activity'
> on SSD ?
>
> How can I be sure CEPH is actually using SSD for WAL /DB ?
>
>
> Note
> I only have 2 HDD and one SSD per server for now
>
>
> Comands used
>
> rados bench -p rbd 50 write -t 32 --no-cleanup && rados bench -p rbd 
> -t 32
> 50 rand
>
>
> /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data 
> /dev/sdc --block.wal 
> /dev/disk/by-partuuid/32ffde6f-7249-40b9-9bc5-2b70f0c3f7ad
> --block.db /dev/disk/by-partuuid/2d9ab913-7553-46fc-8f96-5ffee028098a
>
> ( partitions are on SSD ...see below)
>
>  sgdisk -p /dev/sda
> Disk /dev/sda: 780140544 sectors, 372.0 GiB Logical sector size: 512 
> bytes Disk identifier (GUID): 5FE0EA74-7E65-45B8-A356-62240333491E
> Partition table holds up to 128 entries First usable sector is 34, 
> last usable sector is 780140510 Partitions will be aligned on 
> 2048-sector boundaries Total free space is 520093629 sectors (248.0 
> GiB)
>
> Number  Start (sector)End (sector)  Size   Code  Name
>1   251660288   253757439   1024.0 MiB    ceph WAL
>2204862916607   30.0 GiB  ceph DB
>3   253757440   255854591   1024.0 MiB    ceph WAL
>462916608   125831167   30.0 GiB  ceph DB
>5   255854592   257951743   1024.0 MiB    ceph WAL
>6   125831168   188745727   30.0 GiB  ceph DB
>7   257951744   260048895   1024.0 MiB    ceph WAL
>8   188745728   251660287   30.0 GiB  ceph DB
> [root@osd04 ~]# ls -al /dev/disk/by-partuuid/ total 0 drwxr-xr-x 2 
> root root 200 Apr 26 15:39 .
> drwxr-xr-x 8 root root 160 Apr 27 08:45 ..
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 0baf986d-f786-4c1a-8962-834743b33e3a
> -> ../../sda8
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 2d9ab913-7553-46fc-8f96-5ffee028098a
> -> ../../sda2
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 32ffde6f-7249-40b9-9bc5-2b70f0c3f7ad
> -> ../../sda3
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 3f4e2d47-d553-4809-9d4e-06ba37b4c384
> -> ../../sda6
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 3fc98512-a92e-4e3b-9de7-556b8e206786
> -> ../../sda1
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 64b8ae66-cf37-4676-bf9f-9c4894788a7f
> -> ../../sda7
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> 96254af9-7fe4-4ce0-886e-2e25356eff81
> -> ../../sda5
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 
> ae616b82-35ab-4f7f-9e6f-3c65326d76a8
> -> ../../sda4
>
>
>
>
>
>
>  dm-0 |  busy 90% |  | read2516  | write  0 |
> |  KiB/r512 | KiB/w  0 |   | MBr/s  125.8 | MBw/s0.0
> |   | avq10.65 | avio 3.57 ms  |  |
> LVM | dm-1 |  busy 80% |  | read2406  | write
> 0 |  |  KiB/r512 | KiB/w  0 |   | MBr/s
> 120.3 | MBw/s0.0 |   | avq12.59 | avio 3.30 ms  |
> |
> DSK |  sdc |  busy 90% |  | read5044  | write
> 0 |  |  KiB/r256 | KiB/w  0 |   | MBr/s
> 126.1 | MBw/s0.0 |   | avq19.53 | avio 1.78 ms  |
> |
> DSK |  sdd |  busy 80% |  | read4805  | write
> 0 |  |  KiB/r256 | KiB/w  0 |   | MBr/s
> 120.1 | MBw/s0.0 |   | avq23.97 | avio 1.65 ms  |
> |
> DSK |  sda |  busy  0% |  | read   0  | write
> 7 |  |  KiB/r  0 | KiB/w 10 |   | MBr/s
> 0.0 | MBw/s0.0 |   | avq 0.00 | avio 0.00 ms  |
> |
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_lis
> tinfo.cgi_ceph-2Dusers-2Dceph.com=DwICAg=4DxX-JX0i28X6V65hK0ft5M-1
> rZQeWgdMry9v8-eNr4=eqMv5yFFe6-lAM9jJfUusNFzzcFAGwmoAez_acfPOtw=Gkb
> AzUQpHU6F0PQW4cXglhdQN00DLmI75Ge2zPFqeeQ=R5UDTadunkDZPcYZfMoWS_0Vead
> oXB5jfcy-FKfJYPM=
>

Re: [ceph-users] ceph-mgr not able to modify max_misplaced in 12.2.4

2018-04-27 Thread John Spray
On Fri, Apr 27, 2018 at 7:03 AM, nokia ceph  wrote:
> Hi Team,
>
> I was trying to modify the max_misplaced parameter in 12.2.4 as per
> documentation , however not able to modify it with following error,
>
> #ceph config set mgr mgr/balancer/max_misplaced .06
> Invalid command:  unused arguments: [u'.06']
> config set   :  Set a configuration option at runtime (not
> persistent)
> Error EINVAL: invalid command

Oops - the docs were added recently for the master branch, and there
isn't a luminous version online.  I suspect you won't be the last
person to be caught out by this, so I've created a backport of the
luminous-era commands here that will pop up on
docs.ceph.com/docs/luminous when it's merged --
https://github.com/ceph/ceph/pull/21699/files

Anyway: the command in 12.x is "ceph config-key set
mgr/balancer/max_misplaced ..."

> Also, where I can find the balancer module configuration file , not
> available in /var/lib/ceph/mgr

ceph-mgr module config is not stored in local files -- their
configuration is stored inside the monitors + accessed with commands.

The module config in mimic is mostly unified with the main ceph store
of configuration, so setting module config via ceph.conf may become
possible soon, but commands are always preferable because they give us
a chance to validate the values on the way in and give feedback.

John

>
> cn6.chn6m1c1ru1c1.cdn ~# cd /var/lib/ceph/mgr/
> cn6.chn6m1c1ru1c1.cdn /var/lib/ceph/mgr# ls
> cn6.chn6m1c1ru1c1.cdn /var/lib/ceph/mgr#
>
> Thanks,
> Muthu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 12.2.5 - atop DB/WAL SSD usage 0%

2018-04-27 Thread Serkan Çoban
rados bench is using 4MB block size for io. Try with with io size 4KB,
you will see ssd will be used for write operations.

On Fri, Apr 27, 2018 at 4:54 PM, Steven Vacaroaia  wrote:
> Hi
>
> During rados bench tests, I noticed that HDD usage goes to 100% but SSD
> stays at ( or very close to 0)
>
> Since I created OSD with BLOCK/WAL on SSD, shouldnt  I see some "activity'
> on SSD ?
>
> How can I be sure CEPH is actually using SSD for WAL /DB ?
>
>
> Note
> I only have 2 HDD and one SSD per server for now
>
>
> Comands used
>
> rados bench -p rbd 50 write -t 32 --no-cleanup && rados bench -p rbd -t 32
> 50 rand
>
>
> /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/sdc
> --block.wal /dev/disk/by-partuuid/32ffde6f-7249-40b9-9bc5-2b70f0c3f7ad
> --block.db /dev/disk/by-partuuid/2d9ab913-7553-46fc-8f96-5ffee028098a
>
> ( partitions are on SSD ...see below)
>
>  sgdisk -p /dev/sda
> Disk /dev/sda: 780140544 sectors, 372.0 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 5FE0EA74-7E65-45B8-A356-62240333491E
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 780140510
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 520093629 sectors (248.0 GiB)
>
> Number  Start (sector)End (sector)  Size   Code  Name
>1   251660288   253757439   1024.0 MiB    ceph WAL
>2204862916607   30.0 GiB  ceph DB
>3   253757440   255854591   1024.0 MiB    ceph WAL
>462916608   125831167   30.0 GiB  ceph DB
>5   255854592   257951743   1024.0 MiB    ceph WAL
>6   125831168   188745727   30.0 GiB  ceph DB
>7   257951744   260048895   1024.0 MiB    ceph WAL
>8   188745728   251660287   30.0 GiB  ceph DB
> [root@osd04 ~]# ls -al /dev/disk/by-partuuid/
> total 0
> drwxr-xr-x 2 root root 200 Apr 26 15:39 .
> drwxr-xr-x 8 root root 160 Apr 27 08:45 ..
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 0baf986d-f786-4c1a-8962-834743b33e3a
> -> ../../sda8
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 2d9ab913-7553-46fc-8f96-5ffee028098a
> -> ../../sda2
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 32ffde6f-7249-40b9-9bc5-2b70f0c3f7ad
> -> ../../sda3
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 3f4e2d47-d553-4809-9d4e-06ba37b4c384
> -> ../../sda6
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 3fc98512-a92e-4e3b-9de7-556b8e206786
> -> ../../sda1
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 64b8ae66-cf37-4676-bf9f-9c4894788a7f
> -> ../../sda7
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 96254af9-7fe4-4ce0-886e-2e25356eff81
> -> ../../sda5
> lrwxrwxrwx 1 root root  10 Apr 27 09:38 ae616b82-35ab-4f7f-9e6f-3c65326d76a8
> -> ../../sda4
>
>
>
>
>
>
>  dm-0 |  busy 90% |  | read2516  | write  0 |
> |  KiB/r512 | KiB/w  0 |   | MBr/s  125.8 | MBw/s0.0
> |   | avq10.65 | avio 3.57 ms  |  |
> LVM | dm-1 |  busy 80% |  | read2406  | write
> 0 |  |  KiB/r512 | KiB/w  0 |   | MBr/s
> 120.3 | MBw/s0.0 |   | avq12.59 | avio 3.30 ms  |
> |
> DSK |  sdc |  busy 90% |  | read5044  | write
> 0 |  |  KiB/r256 | KiB/w  0 |   | MBr/s
> 126.1 | MBw/s0.0 |   | avq19.53 | avio 1.78 ms  |
> |
> DSK |  sdd |  busy 80% |  | read4805  | write
> 0 |  |  KiB/r256 | KiB/w  0 |   | MBr/s
> 120.1 | MBw/s0.0 |   | avq23.97 | avio 1.65 ms  |
> |
> DSK |  sda |  busy  0% |  | read   0  | write
> 7 |  |  KiB/r  0 | KiB/w 10 |   | MBr/s
> 0.0 | MBw/s0.0 |   | avq 0.00 | avio 0.00 ms  |
> |
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph 12.2.5 - atop DB/WAL SSD usage 0%

2018-04-27 Thread Steven Vacaroaia
Hi

During rados bench tests, I noticed that HDD usage goes to 100% but SSD
stays at ( or very close to 0)

Since I created OSD with BLOCK/WAL on SSD, shouldnt  I see some "activity'
on SSD ?

How can I be sure CEPH is actually using SSD for WAL /DB ?


Note
I only have 2 HDD and one SSD per server for now


Comands used

rados bench -p rbd 50 write -t 32 --no-cleanup && rados bench -p rbd -t 32
50 rand


/usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/sdc
--block.wal /dev/disk/by-partuuid/32ffde6f-7249-40b9-9bc5-2b70f0c3f7ad
--block.db /dev/disk/by-partuuid/2d9ab913-7553-46fc-8f96-5ffee028098a

( partitions are on SSD ...see below)

 sgdisk -p /dev/sda
Disk /dev/sda: 780140544 sectors, 372.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 5FE0EA74-7E65-45B8-A356-62240333491E
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 780140510
Partitions will be aligned on 2048-sector boundaries
Total free space is 520093629 sectors (248.0 GiB)

Number  Start (sector)End (sector)  Size   Code  Name
   1   251660288   253757439   1024.0 MiB    ceph WAL
   2204862916607   30.0 GiB  ceph DB
   3   253757440   255854591   1024.0 MiB    ceph WAL
   462916608   125831167   30.0 GiB  ceph DB
   5   255854592   257951743   1024.0 MiB    ceph WAL
   6   125831168   188745727   30.0 GiB  ceph DB
   7   257951744   260048895   1024.0 MiB    ceph WAL
   8   188745728   251660287   30.0 GiB  ceph DB
[root@osd04 ~]# ls -al /dev/disk/by-partuuid/
total 0
drwxr-xr-x 2 root root 200 Apr 26 15:39 .
drwxr-xr-x 8 root root 160 Apr 27 08:45 ..
lrwxrwxrwx 1 root root  10 Apr 27 09:38
0baf986d-f786-4c1a-8962-834743b33e3a -> ../../sda8
lrwxrwxrwx 1 root root  10 Apr 27 09:38
2d9ab913-7553-46fc-8f96-5ffee028098a -> ../../sda2
lrwxrwxrwx 1 root root  10 Apr 27 09:38
32ffde6f-7249-40b9-9bc5-2b70f0c3f7ad -> ../../sda3
lrwxrwxrwx 1 root root  10 Apr 27 09:38
3f4e2d47-d553-4809-9d4e-06ba37b4c384 -> ../../sda6
lrwxrwxrwx 1 root root  10 Apr 27 09:38
3fc98512-a92e-4e3b-9de7-556b8e206786 -> ../../sda1
lrwxrwxrwx 1 root root  10 Apr 27 09:38
64b8ae66-cf37-4676-bf9f-9c4894788a7f -> ../../sda7
lrwxrwxrwx 1 root root  10 Apr 27 09:38
96254af9-7fe4-4ce0-886e-2e25356eff81 -> ../../sda5
lrwxrwxrwx 1 root root  10 Apr 27 09:38
ae616b82-35ab-4f7f-9e6f-3c65326d76a8 -> ../../sda4






 dm-0 |  busy 90% |  | read2516  | write  0 |
|  KiB/r512 | KiB/w  0 |   | MBr/s  125.8 |
MBw/s0.0 |   | avq10.65 | avio 3.57 ms  |  |
LVM | dm-1 |  busy 80% |  | read2406  | write
0 |  |  KiB/r512 | KiB/w  0 |   |
MBr/s  120.3 | MBw/s0.0 |   | avq12.59 | avio 3.30 ms
|  |
DSK |  sdc |  busy 90% |  | read5044  | write
0 |  |  KiB/r256 | KiB/w  0 |   |
MBr/s  126.1 | MBw/s0.0 |   | avq19.53 | avio 1.78 ms
|  |
DSK |  sdd |  busy 80% |  | read4805  | write
0 |  |  KiB/r256 | KiB/w  0 |   |
MBr/s  120.1 | MBw/s0.0 |   | avq23.97 | avio 1.65 ms
|  |
DSK |  sda |  busy  0% |  | read   0  | write
7 |  |  KiB/r  0 | KiB/w 10 |   |
MBr/s0.0 | MBw/s0.0 |   | avq 0.00 | avio 0.00 ms
|  |
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] *** SPAM *** Re: Multi-MDS Failover

2018-04-27 Thread Scottix
Hey Dan,

Thanks you for the response, the namespace methodology makes more sense and
I think that explains what would be up or not.

In regards to my original email with number 4 of listing 0 files. I will
try to recreate with debug on and submit an issue if that turns out to be a
bug.

I am sorry if I have offended anyone with my attitude, I am just trying to
get information and understand what is going on. I want Ceph and CephFS to
be the best out there.

Thank you all

On Fri, Apr 27, 2018 at 12:14 AM Dan van der Ster 
wrote:

> Hi Scott,
>
> Multi MDS just assigns different parts of the namespace to different
> "ranks". Each rank (0, 1, 2, ...) is handled by one of the active
> MDSs. (You can query which parts of the name space are assigned to
> each rank using the jq tricks in [1]). If a rank is down and there are
> no more standby's, then you need to bring up a new MDS to handle that
> down rank. In the meantime, part of the namespace will have IO
> blocked.
>
> To handle these failures, you need to configure sufficient standby
> MDSs to handle the failure scenarios you foresee in your environment.
> A strictly "standby" MDS can takeover from *any* of the failed ranks,
> and you can have several "standby" MDSs to cover multiple failures. So
> just run 2 or 3 standby's if you want to be on the safe side.
>
> You can also configure "standby-for-rank" MDSs -- that is, a given
> standby MDS can be watching a specific rank then taking over it that
> specific MDS fails. Those standby-for-rank MDS's can even be "hot"
> standby's to speed up the failover process.
>
> An active MDS for a given rank does not act as a standby for the other
> ranks. I'm not sure if it *could* following some code changes, but
> anyway that just not how it works today.
>
> Does that clarify things?
>
> Cheers, Dan
>
> [1] https://ceph.com/community/new-luminous-cephfs-subtree-pinning/
>
>
> On Fri, Apr 27, 2018 at 4:04 AM, Scottix  wrote:
> > Ok let me try to explain this better, we are doing this back and forth
> and
> > its not going anywhere. I'll just be as genuine as I can and explain the
> > issue.
> >
> > What we are testing is a critical failure scenario and actually more of a
> > real world scenario. Basically just what happens when it is 1AM and the
> shit
> > hits the fan, half of your servers are down and 1 of the 3 MDS boxes are
> > still alive.
> > There is one very important fact that happens with CephFS and when the
> > single Active MDS server fails. It is guaranteed 100% all IO is blocked.
> No
> > split-brain, no corrupted data, 100% guaranteed ever since we started
> using
> > CephFS
> >
> > Now with multi_mds, I understand this changes the logic and I understand
> how
> > difficult and how hard this problem is, trust me I would not be able to
> > tackle this. Basically I need to answer the question; what happens when
> 1 of
> > 2 multi_mds fails with no standbys ready to come save them?
> > What I have tested is not the same of a single active MDS; this
> absolutely
> > changes the logic of what happens and how we troubleshoot. The CephFS is
> > still alive and it does allow operations and does allow resources to go
> > through. How, why and what is affected are very relevant questions if
> this
> > is what the failure looks like since it is not 100% blocking.
> >
> > This is the problem, I have programs writing a massive amount of data
> and I
> > don't want it corrupted or lost. I need to know what happens and I need
> to
> > have guarantees.
> >
> > Best
> >
> >
> > On Thu, Apr 26, 2018 at 5:03 PM Patrick Donnelly 
> > wrote:
> >>
> >> On Thu, Apr 26, 2018 at 4:40 PM, Scottix  wrote:
> >> >> Of course -- the mons can't tell the difference!
> >> > That is really unfortunate, it would be nice to know if the filesystem
> >> > has
> >> > been degraded and to what degree.
> >>
> >> If a rank is laggy/crashed, the file system as a whole is generally
> >> unavailable. The span between partial outage and full is small and not
> >> worth quantifying.
> >>
> >> >> You must have standbys for high availability. This is the docs.
> >> > Ok but what if you have your standby go down and a master go down.
> This
> >> > could happen in the real world and is a valid error scenario.
> >> >Also there is
> >> > a period between when the standby becomes active what happens
> in-between
> >> > that time?
> >>
> >> The standby MDS goes through a series of states where it recovers the
> >> lost state and connections with clients. Finally, it goes active.
> >>
> >> >> It depends(tm) on how the metadata is distributed and what locks are
> >> > held by each MDS.
> >> > Your saying depending on which mds had a lock on a resource it will
> >> > block
> >> > that particular POSIX operation? Can you clarify a little bit?
> >> >
> >> >> Standbys are not optional in any production cluster.
> >> > Of course in production I would hope people have standbys but in

Re: [ceph-users] Inconsistent metadata seen by CephFS-fuse clients

2018-04-27 Thread Oliver Freyermuth
Dear Yan Zheng,

Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
> On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
>  wrote:
>> Dear Cephalopodians,
>>
>> just now that our Ceph cluster is under high I/O load, we get user reports 
>> of files not being seen on some clients,
>> but somehow showing up after forcing a stat() syscall.
>>
>> For example, one user had added several files to a directory via an NFS 
>> client attached to nfs-ganesha (which uses libcephfs),
>> and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
>> Fuse-clients -
>> but one single client still saw the old contents of the directory, i.e. the 
>> files seemed missing(!).
>> This happened both when using "ls" on the directory or when trying to access 
>> the non-existent files directly.
>>
>> I could confirm this observation also in a fresh login shell on the machine.
>>
>> Then, on the "broken" client, I entered in the directory which seemed to 
>> contain only the "old" content, and I created a new file in there.
>> This worked fine, and all other clients saw the file immediately.
>> Also on the broken client, metadata was now updated and all other files 
>> appeared - i.e. everything was "in sync" again.
>>
>> There's nothing in the ceph-logs of our MDS, or in the syslogs of the client 
>> machine / MDS.
>>
>>
>> Another user observed the same, but not explicitly limited to one machine 
>> (it seems random).
>> He now uses a "stat" on the file he expects to exist (but which is not seen 
>> with "ls").
>> The stat returns "No such file", a subsequent "ls" then however lists the 
>> file, and it can be accessed normally.
>>
>> This feels like something is messed up concerning the client caps - these 
>> are all 12.2.4 Fuse clients.
>>
>> Any ideas how to find the cause?
>> It only happens since recently, and under high I/O load with many metadata 
>> operations.
>>
> 
> Sounds like bug in readdir cache. Could you try the attached patch.

Many thanks for the quick response and patch! 
The problem is to try it out. We only observe this issue on our production 
cluster, randomly, especially during high load, and only after is has been 
running for a few days. 
We don't have a test Ceph cluster available of similar size and with similar 
load. I would not like to try out the patch on our production system. 

Can you extrapolate from the bugfix / patch what's the minimal setup needed to 
reproduce / trigger the issue? 
Then we may look into setting up a minimal test setup to check whether the 
issue is resolved. 

All the best and many thanks,
Oliver


> 
> Regards
> Yan, Zheng
> 
> 
>> Cheers,
>> Oliver
>>
>>
> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-27 Thread Igor Gajsin
Thanks a lot for your help.

Konstantin Shalygin writes:

> On 04/27/2018 05:05 PM, Igor Gajsin wrote:
>> I have a crush rule like
>
>
> You still can use device classes!
>
>
>> * host0 has a piece of data on osd.0
> Not peace, full object. If we talk about non-EC pools.
>> * host1 has pieces of data on osd.1 and osd.2
> host1 has copy on osd.1 *or* osd.2
>> * host2 has no data
> host2 also will be have one copy of object.
>
> Also do not forget - hosts with half of osds of host1 (i.e. host0 and
> host2) will be do "double work" in comparison.
> You can minimize this impact via decreasing osd crush weights for host1.
>
>
>
>
>
> k


--
With best regards,
Igor Gajsin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-27 Thread Konstantin Shalygin

On 04/27/2018 05:05 PM, Igor Gajsin wrote:

I have a crush rule like



You still can use device classes!



* host0 has a piece of data on osd.0

Not peace, full object. If we talk about non-EC pools.

* host1 has pieces of data on osd.1 and osd.2

host1 has copy on osd.1 *or* osd.2

* host2 has no data

host2 also will be have one copy of object.

Also do not forget - hosts with half of osds of host1 (i.e. host0 and 
host2) will be do "double work" in comparison.

You can minimize this impact via decreasing osd crush weights for host1.





k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-27 Thread Igor Gajsin
Thanks, man. Thanks a lot. Now I'm understood. So, to be sure If I have 3 hosts,
replicating factor is also 3 and I have a crush rule like:
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}

My data is replicated across hosts, not across osds, all hosts have
pieces of data and a situation like:

* host0 has a piece of data on osd.0
* host1 has pieces of data on osd.1 and osd.2
* host2 has no data

is completely excluded?

Konstantin Shalygin writes:

> On 04/27/2018 04:37 PM, Igor Gajsin wrote:
>> pool 7 'rbd' replicated size 3 min_size 2 crush_rule 0
>
>
> Your pools have proper size settings - is 3. But you crush have only 2
> buckets for this rule (e.g. is your pods).
> For making this rule work you should have minimum of 3 'pod' buckets.
>
>
>
>
> k


--
With best regards,
Igor Gajsin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-27 Thread Konstantin Shalygin

On 04/27/2018 04:37 PM, Igor Gajsin wrote:

pool 7 'rbd' replicated size 3 min_size 2 crush_rule 0



Your pools have proper size settings - is 3. But you crush have only 2 
buckets for this rule (e.g. is your pods).

For making this rule work you should have minimum of 3 'pod' buckets.




k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-27 Thread Igor Gajsin
# ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 last_change 958 lfor 0/909 flags hashpspool 
stripe_width 0 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 954 flags hashpspool stripe_width 0 
application cephfs
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 22 owner 18446744073709551615 flags 
hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 owner 
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 26 owner 18446744073709551615 flags 
hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 28 owner 18446744073709551615 flags 
hashpspool stripe_width 0 application rgw
pool 7 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1161 flags hashpspool stripe_width 0 
application rbd
removed_snaps [1~3]
pool 8 'kube' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins 
pg_num 128 pgp_num 128 last_change 1241 lfor 0/537 flags hashpspool 
stripe_width 0 application cephfs
removed_snaps [1~5,7~2]

crush rule 3 is
ceph osd crush rule dump podshdd
{
"rule_id": 3,
"rule_name": "podshdd",
"ruleset": 3,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "pod"
},
{
"op": "emit"
}
]
}

Konstantin Shalygin writes:

> On 04/26/2018 11:30 PM, Igor Gajsin wrote:
>> after assigning this rule to a pool it stucks in the same state:
>
>
> `ceph osd pool ls detail` please
>
>
>
>
> k


--
With best regards,
Igor Gajsin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to deploy ceph with spdk step by step?

2018-04-27 Thread Yang, Liang
Hi  Nathan Cutler,Orlando 
Moreno, Loic 
Dachary and Sage Weil,

I am making spdk enable on ceph. But I failed. My step is listed as below. 
Could you help check if all the step is right? And help to enable spdk on ceph. 
I know it's very rude, but I need your help. ceph version is 13.0.2 Thank you 
very much.

First step:I have run src/spdk/setup.sh as below:

[root@ceph-rep-05 ceph-ansible]# ../ceph/src/spdk/scripts/setup.sh
0005:01:00.0 (1179 010e): nvme -> vfio-pci

Second step:the ceph.conf about osd is that:
[osd]
bluestore = true
[osd.0]
host = ceph-rep-05
osd data = /var/lib/ceph/osd/ceph-0/
bluestore_block_path = spdk:55cd2e404c7e1063


Third step:
ceph osd create
mkdir /var/lib/ceph/osd/ceph-0/
chown ceph:ceph /var/lib/ceph/osd/ceph-0/
ceph-osd -i 0 --mkfs --osd-data=/var/lib/ceph/osd/ceph-0 -c /etc/ceph/ceph.conf 
--debug_osd 20 -mkkey
ceph-osd -i 0


[root@ceph-rep-05 ceph-ansible-0417]# ceph-osd -i 0 --mkfs 
--osd-data=/var/lib/ceph/osd/ceph-0 -c /etc/ceph/ceph.conf --debug_osd 20
2018-04-27 17:14:24.674 9b5a -1 journal FileJournal::_open: disabling 
aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2018-04-27 17:14:24.804 9b5a -1 journal FileJournal::_open: disabling 
aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2018-04-27 17:14:24.804 9b5a -1 journal do_read_entry(4096): bad header 
magic
2018-04-27 17:14:24.804 9b5a -1 journal do_read_entry(4096): bad header 
magic
[root@ceph-rep-05 ceph-ansible -0417]# ceph-osd -i 0
starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0/ 
/var/lib/ceph/osd/ceph-0/journal
2018-04-27 17:14:44.852 83b2 -1 journal FileJournal::_open: disabling 
aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2018-04-27 17:14:44.852 83b2 -1 journal do_read_entry(8192): bad header 
magic
2018-04-27 17:14:44.852 83b2 -1 journal do_read_entry(8192): bad header 
magic
2018-04-27 17:14:44.872 83b2 -1 osd.0 0 log_to_monitors {default=true}

Last step:
[root@ceph-rep-05 ceph-ansible-0417]# ceph -s
  cluster:
id: e05d6376-6965-4c48-9b36-b8f5c518e3b9
health: HEALTH_WARN
Reduced data availability: 256 pgs inactive
too many PGs per OSD (256 > max 200)

  services:
mon: 1 daemons, quorum ceph-rep-05
mgr: ceph-rep-05(active)
osd: 1 osds: 1 up, 1 in

  data:
pools:   3 pools, 256 pgs
objects: 0 objects, 0
usage:   0 used, 0 / 0 avail
pgs: 100.000% pgs unknown
 256 unknown

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-04-27 Thread Konstantin Shalygin

I've writting a piece of Python code which can be run on a server
running OSDs and will print the overhead.

https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f

Feedback on this script is welcome, but also the output of what people
are observing.



For mixed (filestore / bluestore) osd-host (ignore non bluestore osds):


--- ceph-bluestore-overhead.py_orig 2018-04-27 16:28:42.063312979 +0700
+++ ceph-bluestore-overhead.py  2018-04-27 16:28:01.035236995 +0700
@@ -37,10 +37,11 @@ def get_osd_perf():
 if __name__ == '__main__':
 perf = get_osd_perf()
 for osd_id, perf in perf.items():
-    onodes = perf['bluestore']['bluestore_onodes']
-    stat_bytes_used = perf['osd']['stat_bytes_used']
-    db_used_bytes = perf['bluefs']['db_used_bytes']
-    overhead = db_used_bytes / onodes
-    avg_obj_size = stat_bytes_used / onodes
+   if 'bluestore' in perf:
+    onodes = perf['bluestore']['bluestore_onodes']
+    stat_bytes_used = perf['osd']['stat_bytes_used']
+    db_used_bytes = perf['bluefs']['db_used_bytes']
+    overhead = db_used_bytes / onodes
+    avg_obj_size = stat_bytes_used / onodes

-    print('osd.{0} onodes={1} db_used_bytes={2} avg_obj_size={3} 
overhead_per_obj={4}'.format(osd_id, onodes, db_used_bytes, 
avg_obj_size, overhead))

\ No newline at end of file
+    print('osd.{0} onodes={1} db_used_bytes={2} 
avg_obj_size={3} overhead_per_obj={4}'.format(osd_id, onodes, 
db_used_bytes, avg_obj_size, overhead))





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster can't remapped objects after change crush tree

2018-04-27 Thread Konstantin Shalygin

On 04/26/2018 11:30 PM, Igor Gajsin wrote:

after assigning this rule to a pool it stucks in the same state:



`ceph osd pool ls detail` please




k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd reweight (doing -1 or actually -0.0001)

2018-04-27 Thread Marc Roos
 
Thanks Paul for the explanation, sounds very logical now.




-Original Message-
From: Paul Emmerich [mailto:paul.emmer...@croit.io] 
Sent: woensdag 25 april 2018 20:28
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] ceph osd reweight (doing -1 or actually 
-0.0001)

Hi,



the reweight is internally a number between 0 and 0x1 for the range 
0 to 1.
0.8 is not representable in this number system.

Having an actual floating point number in there would be annoying 
because CRUSH needs to be 100% deterministic on all clients (also, no 
floating point in the kernel).


osd reweight apparently just echoes whatever you entered here if it 
can't be mapped to a whole number:

(Or it rounds differently, not sure and doesn't matter.)


reweight 0.8 --> 0.8 * 0x1 = 52428.8; is cut off to 52428 == 0x.
So it just prints out 0.8 but stores 0x.


The logic to read 0x back and convert it back to the 0 - 1 range is:
0x / 0x1 = 0.79998779296 which is printed as 0.7.


Anyways, the important value is what is actually stored and that's 
0x. 

It could be argued that "osd reweight" should convert 0.8 to 0xcccd 
(0.8305175), i.e. to round instead of cut off.



Paul


2018-04-25 12:33 GMT+02:00 Marc Roos :


 
Makes me also wonder what is actually being used by ceph? And thus 
which 
one is wrong 'ceph osd reweight' output or 'ceph osd df' output. 


-Original Message-
From: Marc Roos 
Sent: woensdag 25 april 2018 11:58
To: ceph-users
Subject: [ceph-users] ceph osd reweight (doing -1 or actually 
-0.0001)


Is there some logic behind why ceph is doing this -1, or is this 
some 
coding error?

0.8 gives 0.7, and 0.80001 gives 0.8

(ceph 12.2.4)


[@~]# ceph osd reweight 11 0.8
reweighted osd.11 to 0.8 ()

[@~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
11   hdd 3.0  0.7  2794G  2503G   290G 89.59 1.33  38



[@~]# ceph osd reweight 11 0.80001
reweighted osd.11 to 0.80001 (cccd)

[@~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
11   hdd 3.0  0.8  2794G  2503G   290G 89.59 1.33  38




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 





-- 

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Collecting BlueStore per Object DB overhead

2018-04-27 Thread Dietmar Rieder
Hi Wido,

thanks for the tool. Here are some stats from our cluster:

Ceph 12.2.4, 240 OSDs, CephFS only

   onodes  db_used_bytes   avg_obj_sizeoverhead_per_obj

Mean214871  1574830080  2082298 7607
Max 309855  3018850304  3349799 17753
Min 61390   203423744   285059  3219
STDEV   63324   561838256   726776  2990

See the attached plot as well.

HTH
   Dietmar

On 04/26/2018 08:35 PM, Wido den Hollander wrote:
> Hi,
> 
> I've been investigating the per object overhead for BlueStore as I've
> seen this has become a topic for a lot of people who want to store a lot
> of small objects in Ceph using BlueStore.
> 
> I've writting a piece of Python code which can be run on a server
> running OSDs and will print the overhead.
> 
> https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> 
> Feedback on this script is welcome, but also the output of what people
> are observing.
> 
> The results from my tests are below, but what I see is that the overhead
> seems to range from 10kB to 30kB per object.
> 
> On RBD-only clusters the overhead seems to be around 11kB, but on
> clusters with a RGW workload the overhead goes higher to 20kB.
> 
> I know that partial overwrites and appends contribute to higher overhead
> on objects and I'm trying to investigate this and share my information
> with the community.
> 
> I have two use-cases who want to store >2 billion objects with a avg
> object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
> become a big problem.
> 
> Anybody willing to share the overhead they are seeing with what use-case?
> 
> The more data we have on this the better we can estimate how DBs need to
> be sized for BlueStore deployments.
> 
> Wido
> 
> # Cluster #1
> osd.25 onodes=178572 db_used_bytes=2188378112 avg_obj_size=6196529
> overhead=12254
> osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
> overhead=10996
> osd.10 onodes=195502 db_used_bytes=2395996160 avg_obj_size=6013645
> overhead=12255
> osd.30 onodes=186172 db_used_bytes=2393899008 avg_obj_size=6359453
> overhead=12858
> osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883
> overhead=10589
> osd.0 onodes=199658 db_used_bytes=2028994560 avg_obj_size=4835928
> overhead=10162
> osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715
> overhead=11687
> 
> # Cluster #2
> osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992
> overhead_per_obj=12508
> osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248
> overhead_per_obj=13473
> osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150
> overhead_per_obj=12924
> osd.2 onodes=185757 db_used_bytes=356722 avg_obj_size=5359974
> overhead_per_obj=19203
> osd.5 onodes=198822 db_used_bytes=3033530368 avg_obj_size=6765679
> overhead_per_obj=15257
> osd.4 onodes=161142 db_used_bytes=2136997888 avg_obj_size=6377323
> overhead_per_obj=13261
> osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527
> overhead_per_obj=11551
> osd.6 onodes=178874 db_used_bytes=2542796800 avg_obj_size=6539688
> overhead_per_obj=14215
> osd.9 onodes=195166 db_used_bytes=2538602496 avg_obj_size=6237672
> overhead_per_obj=13007
> osd.8 onodes=203946 db_used_bytes=3279945728 avg_obj_size=6523555
> overhead_per_obj=16082
> 
> # Cluster 3
> osd.133 onodes=68558 db_used_bytes=15868100608 avg_obj_size=14743206
> overhead_per_obj=231455
> osd.132 onodes=60164 db_used_bytes=13911457792 avg_obj_size=14539445
> overhead_per_obj=231225
> osd.137 onodes=62259 db_used_bytes=15597568000 avg_obj_size=15138484
> overhead_per_obj=250527
> osd.136 onodes=70361 db_used_bytes=14540603392 avg_obj_size=13729154
> overhead_per_obj=206657
> osd.135 onodes=68003 db_used_bytes=12285116416 avg_obj_size=12877744
> overhead_per_obj=180655
> osd.134 onodes=64962 db_used_bytes=14056161280 avg_obj_size=15923550
> overhead_per_obj=216375
> osd.139 onodes=68016 db_used_bytes=20782776320 avg_obj_size=13619345
> overhead_per_obj=305557
> osd.138 onodes=66209 db_used_bytes=12850298880 avg_obj_size=14593418
> overhead_per_obj=194086
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] *** SPAM *** Re: Multi-MDS Failover

2018-04-27 Thread Dan van der Ster
Hi Scott,

Multi MDS just assigns different parts of the namespace to different
"ranks". Each rank (0, 1, 2, ...) is handled by one of the active
MDSs. (You can query which parts of the name space are assigned to
each rank using the jq tricks in [1]). If a rank is down and there are
no more standby's, then you need to bring up a new MDS to handle that
down rank. In the meantime, part of the namespace will have IO
blocked.

To handle these failures, you need to configure sufficient standby
MDSs to handle the failure scenarios you foresee in your environment.
A strictly "standby" MDS can takeover from *any* of the failed ranks,
and you can have several "standby" MDSs to cover multiple failures. So
just run 2 or 3 standby's if you want to be on the safe side.

You can also configure "standby-for-rank" MDSs -- that is, a given
standby MDS can be watching a specific rank then taking over it that
specific MDS fails. Those standby-for-rank MDS's can even be "hot"
standby's to speed up the failover process.

An active MDS for a given rank does not act as a standby for the other
ranks. I'm not sure if it *could* following some code changes, but
anyway that just not how it works today.

Does that clarify things?

Cheers, Dan

[1] https://ceph.com/community/new-luminous-cephfs-subtree-pinning/


On Fri, Apr 27, 2018 at 4:04 AM, Scottix  wrote:
> Ok let me try to explain this better, we are doing this back and forth and
> its not going anywhere. I'll just be as genuine as I can and explain the
> issue.
>
> What we are testing is a critical failure scenario and actually more of a
> real world scenario. Basically just what happens when it is 1AM and the shit
> hits the fan, half of your servers are down and 1 of the 3 MDS boxes are
> still alive.
> There is one very important fact that happens with CephFS and when the
> single Active MDS server fails. It is guaranteed 100% all IO is blocked. No
> split-brain, no corrupted data, 100% guaranteed ever since we started using
> CephFS
>
> Now with multi_mds, I understand this changes the logic and I understand how
> difficult and how hard this problem is, trust me I would not be able to
> tackle this. Basically I need to answer the question; what happens when 1 of
> 2 multi_mds fails with no standbys ready to come save them?
> What I have tested is not the same of a single active MDS; this absolutely
> changes the logic of what happens and how we troubleshoot. The CephFS is
> still alive and it does allow operations and does allow resources to go
> through. How, why and what is affected are very relevant questions if this
> is what the failure looks like since it is not 100% blocking.
>
> This is the problem, I have programs writing a massive amount of data and I
> don't want it corrupted or lost. I need to know what happens and I need to
> have guarantees.
>
> Best
>
>
> On Thu, Apr 26, 2018 at 5:03 PM Patrick Donnelly 
> wrote:
>>
>> On Thu, Apr 26, 2018 at 4:40 PM, Scottix  wrote:
>> >> Of course -- the mons can't tell the difference!
>> > That is really unfortunate, it would be nice to know if the filesystem
>> > has
>> > been degraded and to what degree.
>>
>> If a rank is laggy/crashed, the file system as a whole is generally
>> unavailable. The span between partial outage and full is small and not
>> worth quantifying.
>>
>> >> You must have standbys for high availability. This is the docs.
>> > Ok but what if you have your standby go down and a master go down. This
>> > could happen in the real world and is a valid error scenario.
>> >Also there is
>> > a period between when the standby becomes active what happens in-between
>> > that time?
>>
>> The standby MDS goes through a series of states where it recovers the
>> lost state and connections with clients. Finally, it goes active.
>>
>> >> It depends(tm) on how the metadata is distributed and what locks are
>> > held by each MDS.
>> > Your saying depending on which mds had a lock on a resource it will
>> > block
>> > that particular POSIX operation? Can you clarify a little bit?
>> >
>> >> Standbys are not optional in any production cluster.
>> > Of course in production I would hope people have standbys but in theory
>> > there is no enforcement in Ceph for this other than a warning. So when
>> > you
>> > say not optional that is not exactly true it will still run.
>>
>> It's self-defeating to expect CephFS to enforce having standbys --
>> presumably by throwing an error or becoming unavailable -- when the
>> standbys exist to make the system available.
>>
>> There's nothing to enforce. A warning is sufficient for the operator
>> that (a) they didn't configure any standbys or (b) MDS daemon
>> processes/boxes are going away and not coming back as standbys (i.e.
>> the pool of MDS daemons is decreasing with each failover)
>>
>> --
>> Patrick Donnelly
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com

[ceph-users] ceph-mgr not able to modify max_misplaced in 12.2.4

2018-04-27 Thread nokia ceph
Hi Team,

I was trying to modify the max_misplaced parameter in 12.2.4 as per
documentation , however not able to modify it with following error,

#ceph config set mgr mgr/balancer/max_misplaced .06
Invalid command:  unused arguments: [u'.06']
config set   :  Set a configuration option at runtime (not
persistent)
Error EINVAL: invalid command

Also, where I can find the balancer module configuration file , not
available in /var/lib/ceph/mgr

cn6.chn6m1c1ru1c1.cdn ~# cd /var/lib/ceph/mgr/
cn6.chn6m1c1ru1c1.cdn /var/lib/ceph/mgr# ls
cn6.chn6m1c1ru1c1.cdn /var/lib/ceph/mgr#

Thanks,
Muthu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com