Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-18 Thread Ketil Froyn
I think there may be something wrong with the apt repository for
bionic, actually. Compare the packages available for Xenial:

https://download.ceph.com/debian-luminous/dists/xenial/main/binary-amd64/Packages

to the ones available for Bionic:

https://download.ceph.com/debian-luminous/dists/bionic/main/binary-amd64/Packages

The only package listed in the repository for bionic is ceph-deploy,
while there's lots for xenial. A quick summary:

$ curl -s 
https://download.ceph.com/debian-luminous/dists/bionic/main/binary-amd64/Packages
| grep ^Package | wc -l
1
$ curl -s 
https://download.ceph.com/debian-luminous/dists/xenial/main/binary-amd64/Packages
| grep ^Package | wc -l
63

Ketil

On Tue, 19 Feb 2019 at 02:10, David Turner  wrote:
>
> Everybody is just confused that you don't have a newer version of Ceph 
> available. Are you running `apt-get dist-upgrade` to upgrade ceph? Do you 
> have any packages being held back? There is no reason that Ubuntu 18.04 
> shouldn't be able to upgrade to 12.2.11.
>
> On Mon, Feb 18, 2019, 4:38 PM >
>> Hello people,
>>
>> Am 11. Februar 2019 12:47:36 MEZ schrieb c...@elchaka.de:
>> >Hello Ashley,
>> >
>> >Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick
>> >:
>> >>What does the output of apt-get update look like on one of the nodes?
>> >>
>> >>You can just list the lines that mention CEPH
>> >>
>> >
>> >... .. .
>> >Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393
>> >B]
>> >... .. .
>> >
>> >The Last available is 12.2.8.
>>
>> Any advice or recommends on how to proceed to be able to Update to 
>> mimic/(nautilus)?
>>
>> - Mehmet
>> >
>> >- Mehmet
>> >
>> >>Thanks
>> >>
>> >>On Sun, 10 Feb 2019 at 12:28 AM,  wrote:
>> >>
>> >>> Hello Ashley,
>> >>>
>> >>> Thank you for this fast response.
>> >>>
>> >>> I cannt prove this jet but i am using already cephs own repo for
>> >>Ubuntu
>> >>> 18.04 and this 12.2.7/8 is the latest available there...
>> >>>
>> >>> - Mehmet
>> >>>
>> >>> Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
>> >>> singap...@amerrick.co.uk>:
>> >>> >Around available versions, are you using the Ubuntu repo’s or the
>> >>CEPH
>> >>> >18.04 repo.
>> >>> >
>> >>> >The updates will always be slower to reach you if your waiting for
>> >>it
>> >>> >to
>> >>> >hit the Ubuntu repo vs adding CEPH’s own.
>> >>> >
>> >>> >
>> >>> >On Sun, 10 Feb 2019 at 12:19 AM,  wrote:
>> >>> >
>> >>> >> Hello m8s,
>> >>> >>
>> >>> >> Im curious how we should do an Upgrade of our ceph Cluster on
>> >>Ubuntu
>> >>> >> 16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7
>> >(or
>> >>> >.8?)
>> >>> >>
>> >>> >> For an Upgrade to mimic we should First Update to Last version,
>> >>> >actualy
>> >>> >> 12.2.11 (iirc).
>> >>> >> Which is not possible on 18.04.
>> >>> >>
>> >>> >> Is there a Update path from 12.2.7/8 to actual mimic release or
>> >>> >better the
>> >>> >> upcoming nautilus?
>> >>> >>
>> >>> >> Any advice?
>> >>> >>
>> >>> >> - Mehmet___
>> >>> >> ceph-users mailing list
>> >>> >> ceph-users@lists.ceph.com
>> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>> >>
>> >>> ___
>> >>> ceph-users mailing list
>> >>> ceph-users@lists.ceph.com
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>>
>> >___
>> >ceph-users mailing list
>> >ceph-users@lists.ceph.com
>> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
-Ketil
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread Eugen Block

Hi,


We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran  salt-run state.orch ceph.stage.3
PS: All of the above ran successfully.


you should lead with the information that you're using SES, otherwise  
it's likely that misunderstandings come up.
Anyway, if you change the policy.cfg you should run stage.2 to make  
sure the changes are applied. Although you state that pillar.items  
shows the correct values I would recommend to run that (short) stage,  
too.



The output of ceph osd tree showed that these new disks are currently
in a ghost bucket, not even under root=default and without a weight.


Where is the respective host listed in the tree? Can you show a ceph  
osd tree, please?
Did you remove all OSDs of one host so the complete host is in the  
"ghost bucket" or just single OSDs? Are other OSDs on that host listed  
correctly?
Since deepsea is not aware of the crush map it can't figure out which  
bucket it should put the new OSDs in. So this part is not automated  
(yet?), you have to do it yourself. But if the host containing the  
replaced OSDs is already placed correctly, then there's definitely  
something wrong.



The first step I then tried was to reweight them but found errors
below:
Error ENOENT: device osd. does not appear in the crush map
Error ENOENT: unable to set item id 39 name 'osd.39' weight 5.45599 at
location
{host=veeam-mk2-rack1-osd3,rack=veeam-mk2-rack1,room=veeam-mk2,root=veeam}:
does not exist


You can't reweight the OSD if it's not in a bucket yet, try to move it  
to it's dedicated bucket first, if possible.


As already requested by Konstantin, please paste your osd tree?

Regards,
Eugen


Zitat von John Molefe :


Hi David

Removal process/commands ran as follows:

#ceph osd crush reweight osd. 0
#ceph osd out 
#systemctl stop ceph-osd@
#umount /var/lib/ceph/osd/ceph-

#ceph osd crush remove osd.
#ceph auth del osd.
#ceph osd rm 
#ceph-disk zap /dev/sd??

Adding them back on:

We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran  salt-run state.orch ceph.stage.3
PS: All of the above ran successfully.

The output of ceph osd tree showed that these new disks are currently
in a ghost bucket, not even under root=default and without a weight.

The first step I then tried was to reweight them but found errors
below:
Error ENOENT: device osd. does not appear in the crush map
Error ENOENT: unable to set item id 39 name 'osd.39' weight 5.45599 at
location
{host=veeam-mk2-rack1-osd3,rack=veeam-mk2-rack1,room=veeam-mk2,root=veeam}:
does not exist

But when I run the command: ceph osd find 
v-cph-admin:/testing # ceph osd find 39
{
"osd": 39,
"ip": "143.160.78.97:6870\/24436",
"crush_location": {}
}

Please let me know if there's any other info that you may need to
assist

Regards
J.

David Turner  2019/02/18 17:08 >>>

Also what commands did you run to remove the failed HDDs and the
commands you have so far run to add their replacements back in?

On Sat, Feb 16, 2019 at 9:55 PM Konstantin Shalygin 
wrote:




I recently replaced failed HDDs and removed them from their respective
buckets as per procedure. But I’m now facing an issue when trying to
place new ones back into the buckets. I’m getting an error of ‘osd nr
not found’ OR ‘file or directory not found’ OR command sintax error. I
have been using the commands below: ceph osd crush set  
ceph osd crush  setI do
however find the OSD number when i run command: ceph osd find  Your
assistance/response to this will be highly appreciated. Regards John.
Please, paste your `ceph osd tree`, your version and what exactly error
you get include osd number.
Less obfuscation is better in this, perhaps, simple case.


k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Vrywaringsklousule / Disclaimer:
http://www.nwu.ac.za/it/gov-man/disclaimer.html
( http://www.nwu.ac.za/it/gov-man/disclaimer.html )




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread John Molefe
Hi David

Removal process/commands ran as follows:

#ceph osd crush reweight osd. 0
#ceph osd out 
#systemctl stop ceph-osd@
#umount /var/lib/ceph/osd/ceph-

#ceph osd crush remove osd.
#ceph auth del osd.
#ceph osd rm 
#ceph-disk zap /dev/sd??

Adding them back on:

We skipped stage 1 and replaced the UUIDs of old disks with the new
ones in the policy.cfg
We ran salt '*' pillar.items and confirmed that the output was correct.
It showed the new UUIDs in the correct places.
Next we ran  salt-run state.orch ceph.stage.3
PS: All of the above ran successfully.

The output of ceph osd tree showed that these new disks are currently
in a ghost bucket, not even under root=default and without a weight.

The first step I then tried was to reweight them but found errors
below:
Error ENOENT: device osd. does not appear in the crush map
Error ENOENT: unable to set item id 39 name 'osd.39' weight 5.45599 at
location
{host=veeam-mk2-rack1-osd3,rack=veeam-mk2-rack1,room=veeam-mk2,root=veeam}:
does not exist

But when I run the command: ceph osd find 
v-cph-admin:/testing # ceph osd find 39
{
"osd": 39,
"ip": "143.160.78.97:6870\/24436",
"crush_location": {}
}

Please let me know if there's any other info that you may need to
assist

Regards
J.
>>> David Turner  2019/02/18 17:08 >>>
Also what commands did you run to remove the failed HDDs and the
commands you have so far run to add their replacements back in?

On Sat, Feb 16, 2019 at 9:55 PM Konstantin Shalygin 
wrote:




I recently replaced failed HDDs and removed them from their respective
buckets as per procedure. But I’m now facing an issue when trying to
place new ones back into the buckets. I’m getting an error of ‘osd nr
not found’ OR ‘file or directory not found’ OR command sintax error. I
have been using the commands below: ceph osd crush set  
ceph osd crush  setI do
however find the OSD number when i run command: ceph osd find  Your
assistance/response to this will be highly appreciated. Regards John.  
Please, paste your `ceph osd tree`, your version and what exactly error
you get include osd number.
Less obfuscation is better in this, perhaps, simple case.

 
k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Vrywaringsklousule / Disclaimer:
http://www.nwu.ac.za/it/gov-man/disclaimer.html 
( http://www.nwu.ac.za/it/gov-man/disclaimer.html )  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread Konstantin Shalygin

On 2/18/19 9:43 PM, David Turner wrote:
Do you have historical data from these OSDs to see when/if the DB used 
on osd.73 ever filled up?  To account for this OSD using the slow 
storage for DB, all we need to do is show that it filled up the fast 
DB at least once.  If that happened, then something spilled over to 
the slow storage and has been there ever since.


Yes, I have. Also I checked my JIRA records what I was do at this times 
and marked this on timeline: [1]


Another graph compared osd.(33|73) for a last year: [2]


[1] https://ibb.co/F7smCxW

[1] https://ibb.co/dKWWDzW

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-18 Thread David Turner
Everybody is just confused that you don't have a newer version of Ceph
available. Are you running `apt-get dist-upgrade` to upgrade ceph? Do you
have any packages being held back? There is no reason that Ubuntu 18.04
shouldn't be able to upgrade to 12.2.11.

On Mon, Feb 18, 2019, 4:38 PM  Hello people,
>
> Am 11. Februar 2019 12:47:36 MEZ schrieb c...@elchaka.de:
> >Hello Ashley,
> >
> >Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick
> >:
> >>What does the output of apt-get update look like on one of the nodes?
> >>
> >>You can just list the lines that mention CEPH
> >>
> >
> >... .. .
> >Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393
> >B]
> >... .. .
> >
> >The Last available is 12.2.8.
>
> Any advice or recommends on how to proceed to be able to Update to
> mimic/(nautilus)?
>
> - Mehmet
> >
> >- Mehmet
> >
> >>Thanks
> >>
> >>On Sun, 10 Feb 2019 at 12:28 AM,  wrote:
> >>
> >>> Hello Ashley,
> >>>
> >>> Thank you for this fast response.
> >>>
> >>> I cannt prove this jet but i am using already cephs own repo for
> >>Ubuntu
> >>> 18.04 and this 12.2.7/8 is the latest available there...
> >>>
> >>> - Mehmet
> >>>
> >>> Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
> >>> singap...@amerrick.co.uk>:
> >>> >Around available versions, are you using the Ubuntu repo’s or the
> >>CEPH
> >>> >18.04 repo.
> >>> >
> >>> >The updates will always be slower to reach you if your waiting for
> >>it
> >>> >to
> >>> >hit the Ubuntu repo vs adding CEPH’s own.
> >>> >
> >>> >
> >>> >On Sun, 10 Feb 2019 at 12:19 AM,  wrote:
> >>> >
> >>> >> Hello m8s,
> >>> >>
> >>> >> Im curious how we should do an Upgrade of our ceph Cluster on
> >>Ubuntu
> >>> >> 16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7
> >(or
> >>> >.8?)
> >>> >>
> >>> >> For an Upgrade to mimic we should First Update to Last version,
> >>> >actualy
> >>> >> 12.2.11 (iirc).
> >>> >> Which is not possible on 18.04.
> >>> >>
> >>> >> Is there a Update path from 12.2.7/8 to actual mimic release or
> >>> >better the
> >>> >> upcoming nautilus?
> >>> >>
> >>> >> Any advice?
> >>> >>
> >>> >> - Mehmet___
> >>> >> ceph-users mailing list
> >>> >> ceph-users@lists.ceph.com
> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade Luminous to mimic on Ubuntu 18.04

2019-02-18 Thread ceph
Hello people,

Am 11. Februar 2019 12:47:36 MEZ schrieb c...@elchaka.de:
>Hello Ashley,
>
>Am 9. Februar 2019 17:30:31 MEZ schrieb Ashley Merrick
>:
>>What does the output of apt-get update look like on one of the nodes?
>>
>>You can just list the lines that mention CEPH
>>
>
>... .. .
>Get:6 Https://Download.ceph.com/debian-luminous bionic InRelease [8393
>B]
>... .. .
>
>The Last available is 12.2.8.

Any advice or recommends on how to proceed to be able to Update to 
mimic/(nautilus)?

- Mehmet
>
>- Mehmet
>
>>Thanks
>>
>>On Sun, 10 Feb 2019 at 12:28 AM,  wrote:
>>
>>> Hello Ashley,
>>>
>>> Thank you for this fast response.
>>>
>>> I cannt prove this jet but i am using already cephs own repo for
>>Ubuntu
>>> 18.04 and this 12.2.7/8 is the latest available there...
>>>
>>> - Mehmet
>>>
>>> Am 9. Februar 2019 17:21:32 MEZ schrieb Ashley Merrick <
>>> singap...@amerrick.co.uk>:
>>> >Around available versions, are you using the Ubuntu repo’s or the
>>CEPH
>>> >18.04 repo.
>>> >
>>> >The updates will always be slower to reach you if your waiting for
>>it
>>> >to
>>> >hit the Ubuntu repo vs adding CEPH’s own.
>>> >
>>> >
>>> >On Sun, 10 Feb 2019 at 12:19 AM,  wrote:
>>> >
>>> >> Hello m8s,
>>> >>
>>> >> Im curious how we should do an Upgrade of our ceph Cluster on
>>Ubuntu
>>> >> 16/18.04. As (At least on our 18.04 nodes) we only have 12.2.7
>(or
>>> >.8?)
>>> >>
>>> >> For an Upgrade to mimic we should First Update to Last version,
>>> >actualy
>>> >> 12.2.11 (iirc).
>>> >> Which is not possible on 18.04.
>>> >>
>>> >> Is there a Update path from 12.2.7/8 to actual mimic release or
>>> >better the
>>> >> upcoming nautilus?
>>> >>
>>> >> Any advice?
>>> >>
>>> >> - Mehmet___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Anthony D'Atri
On older releases, at least, inflated DBs correlated with miserable recovery 
performance and lots of slow requests.  The  DB and OSDs were also on HDD FWIW. 
  A single drive failure would result in substantial RBD impact.  

> On Feb 18, 2019, at 3:28 AM, Dan van der Ster  wrote:
> 
> Not really.
> 
> You should just restart your mons though -- if done one at a time it
> has zero impact on your clients.
> 
> -- dan
> 
> 
> On Mon, Feb 18, 2019 at 12:11 PM M Ranga Swami Reddy
>  wrote:
>> 
>> Hi Sage - If the mon data increases, is this impacts the ceph cluster
>> performance (ie on ceph osd bench, etc)?
>> 
>> On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy
>>  wrote:
>>> 
>>> today I again hit the warn with 30G also...
>>> 
 On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
 
> On Thu, 7 Feb 2019, Dan van der Ster wrote:
> On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
>  wrote:
>> 
>> Hi Dan,
>>> During backfilling scenarios, the mons keep old maps and grow quite
>>> quickly. So if you have balancing, pg splitting, etc. ongoing for
>>> awhile, the mon stores will eventually trigger that 15GB alarm.
>>> But the intended behavior is that once the PGs are all active+clean,
>>> the old maps should be trimmed and the disk space freed.
>> 
>> old maps not trimmed after cluster reached to "all+clean" state for all 
>> PGs.
>> Is there (known) bug here?
>> As the size of dB showing > 15G, do I need to run the compact commands
>> to do the trimming?
> 
> Compaction isn't necessary -- you should only need to restart all
> peon's then the leader. A few minutes later the db's should start
> trimming.
 
 The next time someone sees this behavior, can you please
 
 - enable debug_mon = 20 on all mons (*before* restarting)
   ceph tell mon.* injectargs '--debug-mon 20'
 - wait for 10 minutes or so to generate some logs
 - add 'debug mon = 20' to ceph.conf (on mons only)
 - restart the monitors
 - wait for them to start trimming
 - remove 'debug mon = 20' from ceph.conf (on mons only)
 - tar up the log files, ceph-post-file them, and share them with ticket
 http://tracker.ceph.com/issues/38322
 
 Thanks!
 sage
 
 
 
 
> -- dan
> 
> 
>> 
>> Thanks
>> Swami
>> 
>>> On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> With HEALTH_OK a mon data dir should be under 2GB for even such a large 
>>> cluster.
>>> 
>>> During backfilling scenarios, the mons keep old maps and grow quite
>>> quickly. So if you have balancing, pg splitting, etc. ongoing for
>>> awhile, the mon stores will eventually trigger that 15GB alarm.
>>> But the intended behavior is that once the PGs are all active+clean,
>>> the old maps should be trimmed and the disk space freed.
>>> 
>>> However, several people have noted that (at least in luminous
>>> releases) the old maps are not trimmed until after HEALTH_OK *and* all
>>> mons are restarted. This ticket seems related:
>>> http://tracker.ceph.com/issues/37875
>>> 
>>> (Over here we're restarting mons every ~2-3 weeks, resulting in the
>>> mon stores dropping from >15GB to ~700MB each time).
>>> 
>>> -- Dan
>>> 
>>> 
 On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
 
 Hi Swami
 
 The limit is somewhat arbitrary, based on cluster sizes we had seen 
 when
 we picked it.  In your case it should be perfectly safe to increase it.
 
 sage
 
 
> On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> 
> Hello -  Are the any limits for mon_data_size for cluster with 2PB
> (with 2000+ OSDs)?
> 
> Currently it set as 15G. What is logic behind this? Can we increase
> when we get the mon_data_size_warn messages?
> 
> I am getting the mon_data_size_warn message even though there a ample
> of free space on the disk (around 300G free disk)
> 
> Earlier thread on the same discusion:
> https://www.spinics.net/lists/ceph-users/msg42456.html
> 
> Thanks
> Swami
> 
> 
> 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding EC properties for CephFS / small files.

2019-02-18 Thread Patrick Donnelly
Hello Jesper,

On Sat, Feb 16, 2019 at 11:11 PM  wrote:
>
> Hi List.
>
> I'm trying to understand the nuts and bolts of EC / CephFS
> We're running an EC4+2 pool on top of 72 x 7.2K rpm 10TB drives. Pretty
> slow bulk / archive storage.
>
> # getfattr -n ceph.dir.layout /mnt/home/cluster/mysqlbackup
> getfattr: Removing leading '/' from absolute path names
> # file: mnt/home/cluster/mysqlbackup
> ceph.dir.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304
> pool=cephfs_data_ec42"
>
> This configuration is taken directly out of the online documentation:
> (Which may have been where it went all wrong from our perspective):

Correction: this is from the Ceph default for the file layout. The
default is that no file striping is performed and 4MB chunks are used
for file blocks. You may find this document instructive on how files
are striped (especially the ASCII art):

https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst

> http://docs.ceph.com/docs/master/cephfs/file-layouts/
>
> Ok, this means that a 16MB file will be split at 4 chuncks of 4MB each
> with 2 erasure coding chuncks? I dont really understand the stripe_count
> element?

A 16 MB file would be split into 4 RADOS objects. Then those objects
would be distributed across OSDs according to the EC profile.

> And since erasure-coding works at the object level, striping individual
> objects across - here 4 replicas - it'll end up filling 16MB ? Or
> is there an internal optimization causing this not to be the case?
>
> Additionally, when reading the file, all 4 chunck need to be read to
> assemble the object. Causing (at a minumum) 4 IOPS per file.
>
> Now, my common file size is < 8MB and commonly 512KB files are on
> this pool.
>
> Will that cause a 512KB file to be padded to 4MB with 3 empty chuncks
> to fill the erasure coded profile and then 2 coding chuncks on top?
> In total 24MB for storing 512KB ?

No. Files do not always use the full 4MB chunk. The final chunk of the
file will be minimally sized. For example:

pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ cp /bin/grep .
pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ stat grep
  File: 'grep'
  Size: 211224  Blocks: 413IO Block: 4194304 regular file
Device: 2ch/44d Inode: 1099511627836  Links: 1
Access: (0750/-rwxr-x---)  Uid: ( 1163/pdonnell)   Gid: ( 1163/pdonnell)
Access: 2019-02-18 14:02:11.503875296 -0500
Modify: 2019-02-18 14:02:11.523375657 -0500
Change: 2019-02-18 14:02:11.523375657 -0500
 Birth: -
pdonnell@senta02 ~/mnt/tmp.ZS9VCMhBWg$ printf %x 1099511627836
13c
$ bin/rados -p cephfs.a.data stat 13c.
cephfs.a.data/13c. mtime 2019-02-18 14:02:11.00, size 211224

So the object holding "grep" still only uses ~200KB and not 4MB.


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRC channels now require registered and identified users

2019-02-18 Thread David Turner
Is this still broken in the 1-way direction where Slack users' comments do
not show up in IRC?  That would explain why nothing I ever type (as either
helping someone or asking a question) ever have anyone respond to them.

On Tue, Dec 18, 2018 at 6:50 AM Joao Eduardo Luis  wrote:

> On 12/18/2018 11:22 AM, Joao Eduardo Luis wrote:
> > On 12/18/2018 11:18 AM, Dan van der Ster wrote:
> >> Hi Joao,
> >>
> >> Has that broken the Slack connection? I can't tell if its broken or
> >> just quiet... last message on #ceph-devel was today at 1:13am.
> >
> > Just quiet, it seems. Just tested it and the bridge is still working.
>
> Okay, turns out the ceph-ircslackbot user is not identified, and that
> makes it unable to send messages to the channel. This means the bridge
> is working in one direction only (irc to slack), and will likely break
> when/if the user leaves the channel (as it won't be able to get back in).
>
> I will figure out just how this works today. In the mean time, I've
> relaxed the requirement for registered/identified users so that the bot
> works again. It will be reactivated once this is addressed.
>
>   -Joao
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Doubts about parameter "osd sleep recovery"

2019-02-18 Thread Fabio Abreu
Hi Jean-Charles ,

I will validate this config in my laboratory and production, and share the
results here.

Thanks.

Regards ,
Fabio Abreu

On Mon, Feb 18, 2019 at 3:18 PM Jean-Charles Lopez 
wrote:

> Hi Fabio,
>
> have a look here:
> https://github.com/ceph/ceph/blob/luminous/src/common/options.cc#L2355
>
> It’s designed to relieve the pressure generated by the recovery and
> backfill on both the drives and the network as it slows down these
> activities by introducing a sleep after these respectives ops.
>
> Regards
> JC
>
> On Feb 18, 2019, at 09:28, Fabio Abreu  wrote:
>
> Hi Everybody !
>
> I finding configure my cluster to receives news disks and pgs and after
> configure the main standard configuration too, I look the parameter "osd
> sleep recovery" to implement in production environment but I find just
> sample doc about this config.
>
> Someone have experience with this parameter ?
>
> Only discussion in the internet about this :
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025574.html
>
> My main configuration to receive new osds in Jewel 10.2.7 cluster :
>
> Before include new nodes :
> $ ceph tell osd.* injectargs '--osd-max-backfills 2'
> $ ceph tell osd.* injectargs '--osd-recovery-threads 1'
> $ ceph tell osd.* injectargs '--osd-recovery-op-priority 2'
> $ ceph tell osd.* injectargs '--osd-client-op-priority 63'
> $ ceph tell osd.* injectargs '--osd-recovery-max-active 2'
>
> After include new nodes
> $ ceph tell osd.* injectargs '--osd-max-backfills 1'
> $ ceph tell osd.* injectargs '--osd-recovery-threads 1'
> $ ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
> $ ceph tell osd.* injectargs '--osd-client-op-priority 63'
> $ ceph tell osd.* injectargs '--osd-recovery-max-active 1'
>
>
> Regards,
>
> Fabio Abreu Reis
> http://fajlinux.com.br
> *Tel : *+55 21 98244-0161
> *Skype : *fabioabreureis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

-- 
Atenciosamente,
Fabio Abreu Reis
http://fajlinux.com.br
*Tel : *+55 21 98244-0161
*Skype : *fabioabreureis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - read latency.

2019-02-18 Thread Patrick Donnelly
On Sun, Feb 17, 2019 at 9:51 PM  wrote:
>
> > Probably not related to CephFS. Try to compare the latency you are
> > seeing to the op_r_latency reported by the OSDs.
> >
> > The fast_read option on the pool can also help a lot for this IO pattern.
>
> Magic, that actually cut the read-latency in half - making it more
> aligned with what to expect from the HW+network side:
>
> N   Min   MaxMedian   AvgStddev
> x 100  0.015687  0.221538  0.0252530.03259606   0.028827849
>
> 25ms as a median, 32ms average is still on the high side,
> but way, way better.

I'll use this opportunity to point out that serial archive programs
like tar are terrible for distributed file systems. It would be
awesome if someone multithreaded tar or extended it for asynchronous
I/O. If only I had more time (TM)...

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Doubts about parameter "osd sleep recovery"

2019-02-18 Thread Jean-Charles Lopez
Hi Fabio,

have a look here: 
https://github.com/ceph/ceph/blob/luminous/src/common/options.cc#L2355 


It’s designed to relieve the pressure generated by the recovery and backfill on 
both the drives and the network as it slows down these activities by 
introducing a sleep after these respectives ops.

Regards
JC

> On Feb 18, 2019, at 09:28, Fabio Abreu  wrote:
> 
> Hi Everybody !
> 
> I finding configure my cluster to receives news disks and pgs and after 
> configure the main standard configuration too, I look the parameter "osd 
> sleep recovery" to implement in production environment but I find just sample 
> doc about this config. 
> 
> Someone have experience with this parameter ? 
> 
> Only discussion in the internet about this :
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025574.html 
> 
> 
> My main configuration to receive new osds in Jewel 10.2.7 cluster : 
> 
> Before include new nodes : 
> $ ceph tell osd.* injectargs '--osd-max-backfills 2'
> $ ceph tell osd.* injectargs '--osd-recovery-threads 1'
> $ ceph tell osd.* injectargs '--osd-recovery-op-priority 2'
> $ ceph tell osd.* injectargs '--osd-client-op-priority 63'
> $ ceph tell osd.* injectargs '--osd-recovery-max-active 2'
> 
> After include new nodes 
> $ ceph tell osd.* injectargs '--osd-max-backfills 1'
> $ ceph tell osd.* injectargs '--osd-recovery-threads 1'
> $ ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
> $ ceph tell osd.* injectargs '--osd-client-op-priority 63'
> $ ceph tell osd.* injectargs '--osd-recovery-max-active 1'
> 
> 
> Regards, 
> 
> Fabio Abreu Reis
> http://fajlinux.com.br 
> Tel : +55 21 98244-0161
> Skype : fabioabreureis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph auth caps 'create rbd image' permission

2019-02-18 Thread Jason Dillaman
You could try something similar to what was described here [1]:

mon 'profile rbd' osd 'allow class-read object_prefix rbd_children,
allow r class-read object_prefix rbd_directory, allow r class-read
object_prefix rbd_id.', allow rwx object_prefix rbd_header., allow rwx
object_prefix rbd_data., allow rwx object_prefix rbd_object_map.'

[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008320.html

On Sat, Feb 16, 2019 at 5:43 AM Marc Roos  wrote:
>
>
> Currently I am using 'profile rbd' on mon and osd. Is it possible with
> the caps to allow a user to
>
> - List rbd images
> - get state of images
> - write/read to images
> Etc
>
> But do not allow to have it create new images?
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some ceph config parameters default values

2019-02-18 Thread Neha Ojha
On Sat, Feb 16, 2019 at 12:44 PM Oliver Freyermuth
 wrote:
>
> Dear Cephalopodians,
>
> in some recent threads on this list, I have read about the "knobs":
>
>   pglog_hardlimit (false by default, available at least with 12.2.11 and 
> 13.2.5)
>   bdev_enable_discard (false by default, advanced option, no description)
>   bdev_async_discard  (false by default, advanced option, no description)
>
> I am wondering about the defaults for these settings, and why these settings 
> seem mostly undocumented.
>
> It seems to me that on SSD / NVMe devices, you would always want to enable 
> discard for significantly increased lifetime,
> or run fstrim regularly (which you can't with bluestore since it's a 
> filesystem of its own). From personal experience,
> I have already lost two eMMC devices in Android phones early due to trimming 
> not working fine.
> Of course, on first generation SSD devices, "discard" may lead to data loss 
> (which for most devices has been fixed with firmware updates, though).
>
> I would presume that async-discard is also advantageous, since it seems to 
> queue the discards and work on these in bulk later
> instead of issuing them immediately (that's what I grasp from the code).
>
> Additionally, it's unclear to me whether the bdev-discard settings also 
> affect WAL/DB devices, which are very commonly SSD/NVMe devices
> in the Bluestore age.
>
> Concerning the pglog_hardlimit, I read on that list that it's safe and limits 
> maximum memory consumption especially for backfills / during recovery.
> So it "sounds" like this is also something that could be on by default. But 
> maybe that is not the case yet to allow downgrades after failed upgrades?

This flag will be on by default in nautilus and that's not the case in
luminous and mimic to a handle upgrades.
>
>
>
> So in the end, my question is:
> Is there a reason why these values are not on by default, and are also not 
> really mentioned in the documentation?
> Are they just "not ready yet" / unsafe to be on by default, or are the 
> defaults just like that because they have always been at this value,
> and defaults will change with the next major release (nautilus)?

We can certainly make this more explicit in our documentation.

Thanks,
Neha

>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Doubts about parameter "osd sleep recovery"

2019-02-18 Thread Fabio Abreu
Hi Everybody !

I finding configure my cluster to receives news disks and pgs and after
configure the main standard configuration too, I look the parameter "osd
sleep recovery" to implement in production environment but I find just
sample doc about this config.

Someone have experience with this parameter ?

Only discussion in the internet about this :
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025574.html

My main configuration to receive new osds in Jewel 10.2.7 cluster :

Before include new nodes :
$ ceph tell osd.* injectargs '--osd-max-backfills 2'
$ ceph tell osd.* injectargs '--osd-recovery-threads 1'
$ ceph tell osd.* injectargs '--osd-recovery-op-priority 2'
$ ceph tell osd.* injectargs '--osd-client-op-priority 63'
$ ceph tell osd.* injectargs '--osd-recovery-max-active 2'

After include new nodes
$ ceph tell osd.* injectargs '--osd-max-backfills 1'
$ ceph tell osd.* injectargs '--osd-recovery-threads 1'
$ ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
$ ceph tell osd.* injectargs '--osd-client-op-priority 63'
$ ceph tell osd.* injectargs '--osd-recovery-max-active 1'


Regards,

Fabio Abreu Reis
http://fajlinux.com.br
*Tel : *+55 21 98244-0161
*Skype : *fabioabreureis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-18 Thread Marc Roos
 
Why not just keep it bare metal? Especially with future ceph 
upgrading/testing. I am having centos7 with luminous and am running 
libvirt on the nodes aswell. If you configure them with a tls/ssl 
connection, you can even nicely migrate a vm, from one host/ceph node to 
the other. 
Next thing I am testing with is mesos, to use the ceph nodes to run 
containers. I am still testing this on some vm's, but looks like you 
have to install only a few rpms (maybe around 300MB) and 2 extra 
services on the nodes to get this up and running aswell. (But keep in 
mind that the help on their mailing list is not so good as here ;))



-Original Message-
From: David Turner [mailto:drakonst...@gmail.com] 
Sent: 18 February 2019 17:31
To: ceph-users
Subject: [ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

I'm getting some "new" (to me) hardware that I'm going to upgrade my 
home Ceph cluster with.  Currently it's running a Proxmox cluster 
(Debian) which precludes me from upgrading to Mimic.  I am thinking 
about taking the opportunity to convert most of my VMs into containers 
and migrate my cluster into a K8s + Rook configuration now that Ceph is 
[1] stable on Rook.

I haven't ever configured a K8s cluster and am planning to test this out 
on VMs before moving to it with my live data.  Has anyone done a 
migration from a baremetal Ceph cluster into K8s + Rook?  Additionally 
what is a good way for a K8s beginner to get into managing a K8s 
cluster.  I see various places recommend either CoreOS or kubeadm for 
starting up a new K8s cluster but I don't know the pros/cons for either.

As far as migrating the Ceph services into Rook, I would assume that the 
process would be pretty simple to add/create new mons, mds, etc into 
Rook with the baremetal cluster details.  Once those are active and 
working just start decommissioning the services on baremetal.  For me, 
the OSD migration should be similar since I don't have any multi-device 
OSDs so I only need to worry about migrating individual disks between 
nodes.


[1] 
https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-18 Thread Jeff Layton
On Mon, 2019-02-18 at 17:02 +0100, Paul Emmerich wrote:
> > > I've benchmarked a ~15% performance difference in IOPS between cache
> > > expiration time of 0 and 10 when running fio on a single file from a
> > > single client.
> > > 
> > > 
> > 
> > NFS iops? I'd guess more READ ops in particular? Is that with a
> > FSAL_CEPH backend?
> 
> Yes. But that take that with a grain of salt, that was just a quick
> and dirty test of a very specific scenario that may or may not be
> relevant.
> 
> 

Sure.

If the NFS iops go up when you remove a layer of caching, then that
suggests that you had a situation where the cache likely should have
been invalidated, but wasn't. Basically, you may be sacrificing cache
coherency for performance.

The bigger question I have is whether the ganesha mdcache provides any
performance gain when the attributes are already cached in the libcephfs
layer.

If we did want to start using the mdcache, then we'd almost certainly
want to invalidate that cache when libcephfs gives up caps. I just don't
see how the extra layer of caching provides much value in that
situation.


> > 
> > > > > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  
> > > > > wrote:
> > > > > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > > > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > > > > > Will Client query 'change' attribute every time before reading to 
> > > > > > > know
> > > > > > > if the data has been changed?
> > > > > > > 
> > > > > > >   
> > > > > > > +-+++-+---+
> > > > > > >   | Name| ID | Data Type  | Acc | Defined in  
> > > > > > >   |
> > > > > > >   
> > > > > > > +-+++-+---+
> > > > > > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1 
> > > > > > >   |
> > > > > > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2 
> > > > > > >   |
> > > > > > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3 
> > > > > > >   |
> > > > > > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4 
> > > > > > >   |
> > > > > > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5 
> > > > > > >   |
> > > > > > >   | link_support| 5  | bool   | R   | Section 5.8.1.6 
> > > > > > >   |
> > > > > > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7 
> > > > > > >   |
> > > > > > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8 
> > > > > > >   |
> > > > > > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9 
> > > > > > >   |
> > > > > > >   | unique_handles  | 9  | bool   | R   | Section 
> > > > > > > 5.8.1.10  |
> > > > > > >   | lease_time  | 10 | nfs_lease4 | R   | Section 
> > > > > > > 5.8.1.11  |
> > > > > > >   | rdattr_error| 11 | nfsstat4   | R   | Section 
> > > > > > > 5.8.1.12  |
> > > > > > >   | filehandle  | 19 | nfs_fh4| R   | Section 
> > > > > > > 5.8.1.13  |
> > > > > > >   
> > > > > > > +-+++-+---+
> > > > > > > 
> > > > > > 
> > > > > > Not every time -- only when the cache needs revalidation.
> > > > > > 
> > > > > > In the absence of a delegation, that happens on a timeout (see the
> > > > > > acregmin/acregmax settings in nfs(5)), though things like opens and 
> > > > > > file
> > > > > > locking events also affect when the client revalidates.
> > > > > > 
> > > > > > When the v4 client does revalidate the cache, it relies heavily on 
> > > > > > NFSv4
> > > > > > change attribute. Cephfs's change attribute is cluster-coherent 
> > > > > > too, so
> > > > > > if the client does revalidate it should see changes made on other
> > > > > > servers.
> > > > > > 
> > > > > > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton 
> > > > > > >  wrote:
> > > > > > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > > > > > Hi Jeff,
> > > > > > > > > Another question is about Client Caching when disabling 
> > > > > > > > > delegation.
> > > > > > > > > I set breakpoint on nfs4_op_read, which is OP_READ process 
> > > > > > > > > function in
> > > > > > > > > nfs-ganesha. Then I read a file, I found that it will hit 
> > > > > > > > > only once on
> > > > > > > > > the first time, which means latter reading operation on this 
> > > > > > > > > file will
> > > > > > > > > not trigger OP_READ. It will read the data from client side 
> > > > > > > > > cache. Is
> > > > > > > > > it right?
> > > > > > > > 
> > > > > > > > Yes. In the absence of a delegation, the client will 
> > > > > > > > periodically query
> > > > > > > > for the inode attributes, and will serve reads from the cache 
> > > > > > > > if it
> > > > > > > > looks like the file hasn't changed.
> > > > > > > > 
> > > > > > > > > I also checked the nfs client code in linux kernel. Only
> > > > > > > > > cache_validity is 

[ceph-users] Migrating a baremetal Ceph cluster into K8s + Rook

2019-02-18 Thread David Turner
I'm getting some "new" (to me) hardware that I'm going to upgrade my home
Ceph cluster with.  Currently it's running a Proxmox cluster (Debian) which
precludes me from upgrading to Mimic.  I am thinking about taking the
opportunity to convert most of my VMs into containers and migrate my
cluster into a K8s + Rook configuration now that Ceph is [1] stable on Rook.

I haven't ever configured a K8s cluster and am planning to test this out on
VMs before moving to it with my live data.  Has anyone done a migration
from a baremetal Ceph cluster into K8s + Rook?  Additionally what is a good
way for a K8s beginner to get into managing a K8s cluster.  I see various
places recommend either CoreOS or kubeadm for starting up a new K8s cluster
but I don't know the pros/cons for either.

As far as migrating the Ceph services into Rook, I would assume that the
process would be pretty simple to add/create new mons, mds, etc into Rook
with the baremetal cluster details.  Once those are active and working just
start decommissioning the services on baremetal.  For me, the OSD migration
should be similar since I don't have any multi-device OSDs so I only need
to worry about migrating individual disks between nodes.


[1] https://blog.rook.io/rook-v0-9-new-storage-backends-in-town-ab952523ec53
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-18 Thread Paul Emmerich
> >
> > I've benchmarked a ~15% performance difference in IOPS between cache
> > expiration time of 0 and 10 when running fio on a single file from a
> > single client.
> >
> >
>
> NFS iops? I'd guess more READ ops in particular? Is that with a
> FSAL_CEPH backend?

Yes. But that take that with a grain of salt, that was just a quick
and dirty test of a very specific scenario that may or may not be
relevant.


Paul

>
>
> >
> > >
> > > > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  
> > > > wrote:
> > > > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > > > > Will Client query 'change' attribute every time before reading to 
> > > > > > know
> > > > > > if the data has been changed?
> > > > > >
> > > > > >   
> > > > > > +-+++-+---+
> > > > > >   | Name| ID | Data Type  | Acc | Defined in
> > > > > > |
> > > > > >   
> > > > > > +-+++-+---+
> > > > > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   
> > > > > > |
> > > > > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   
> > > > > > |
> > > > > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   
> > > > > > |
> > > > > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4   
> > > > > > |
> > > > > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5   
> > > > > > |
> > > > > >   | link_support| 5  | bool   | R   | Section 5.8.1.6   
> > > > > > |
> > > > > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   
> > > > > > |
> > > > > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   
> > > > > > |
> > > > > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   
> > > > > > |
> > > > > >   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  
> > > > > > |
> > > > > >   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  
> > > > > > |
> > > > > >   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  
> > > > > > |
> > > > > >   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  
> > > > > > |
> > > > > >   
> > > > > > +-+++-+---+
> > > > > >
> > > > >
> > > > > Not every time -- only when the cache needs revalidation.
> > > > >
> > > > > In the absence of a delegation, that happens on a timeout (see the
> > > > > acregmin/acregmax settings in nfs(5)), though things like opens and 
> > > > > file
> > > > > locking events also affect when the client revalidates.
> > > > >
> > > > > When the v4 client does revalidate the cache, it relies heavily on 
> > > > > NFSv4
> > > > > change attribute. Cephfs's change attribute is cluster-coherent too, 
> > > > > so
> > > > > if the client does revalidate it should see changes made on other
> > > > > servers.
> > > > >
> > > > > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton 
> > > > > >  wrote:
> > > > > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > > > > Hi Jeff,
> > > > > > > > Another question is about Client Caching when disabling 
> > > > > > > > delegation.
> > > > > > > > I set breakpoint on nfs4_op_read, which is OP_READ process 
> > > > > > > > function in
> > > > > > > > nfs-ganesha. Then I read a file, I found that it will hit only 
> > > > > > > > once on
> > > > > > > > the first time, which means latter reading operation on this 
> > > > > > > > file will
> > > > > > > > not trigger OP_READ. It will read the data from client side 
> > > > > > > > cache. Is
> > > > > > > > it right?
> > > > > > >
> > > > > > > Yes. In the absence of a delegation, the client will periodically 
> > > > > > > query
> > > > > > > for the inode attributes, and will serve reads from the cache if 
> > > > > > > it
> > > > > > > looks like the file hasn't changed.
> > > > > > >
> > > > > > > > I also checked the nfs client code in linux kernel. Only
> > > > > > > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ 
> > > > > > > > again,
> > > > > > > > like this:
> > > > > > > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> > > > > > > > ret = nfs_invalidate_mapping(inode, mapping);
> > > > > > > > }
> > > > > > > > This about this senario, client1 connect ganesha1 and client2 
> > > > > > > > connect
> > > > > > > > ganesha2. I read /1.txt on client1 and client1 will cache the 
> > > > > > > > data.
> > > > > > > > Then I modify this file on client2. At that time, how client1 
> > > > > > > > know the
> > > > > > > > file is modifed and how it will add NFS_INO_INVALID_DATA into
> > > > > > > > cache_validity?
> > > > > > >
> > > > > > > Once you modify the code on client2, ganesha2 will request the 
> > > > > > > necessary
> > > > > > > caps from the ceph MDS, and client1 will have 

Re: [ceph-users] CephFS: client hangs

2019-02-18 Thread Ashley Merrick
Correct yes from my expirence OSD’s aswel.

On Mon, 18 Feb 2019 at 11:51 PM, Hennen, Christian <
christian.hen...@uni-trier.de> wrote:

> Hi!
>
> >mon_max_pg_per_osd = 400
> >
> >In the ceph.conf and then restart all the services / or inject the config
> >into the running admin
>
> I restarted all MONs, but I assume the OSDs need to be restarted as well?
>
> > MDS show a client got evicted. Nothing else looks abnormal.  Do new
> cephfs
> > clients also get evicted quickly?
>
> Yeah, it seems so. But strangely there is no indication of it in 'ceph -s'
> or
> 'ceph health detail'. And they don't seem to be evicted permanently? Right
> now, only 1 client is connected. The others are shut down since last week.
> 'ceph osd blacklist ls' shows 0 entries.
>
>
> Kind regards
> Christian Hennen
>
> Project Manager Infrastructural Services ZIMK University of Trier
> Germany
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: client hangs

2019-02-18 Thread Hennen, Christian
Hi!

>mon_max_pg_per_osd = 400
>
>In the ceph.conf and then restart all the services / or inject the config 
>into the running admin

I restarted all MONs, but I assume the OSDs need to be restarted as well?

> MDS show a client got evicted. Nothing else looks abnormal.  Do new cephfs 
> clients also get evicted quickly?

Yeah, it seems so. But strangely there is no indication of it in 'ceph -s' or 
'ceph health detail'. And they don't seem to be evicted permanently? Right 
now, only 1 client is connected. The others are shut down since last week. 
'ceph osd blacklist ls' shows 0 entries.


Kind regards
Christian Hennen

Project Manager Infrastructural Services ZIMK University of Trier
Germany



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-18 Thread Jeff Layton
On Mon, 2019-02-18 at 16:40 +0100, Paul Emmerich wrote:
> > A call into libcephfs from ganesha to retrieve cached attributes is
> > mostly just in-memory copies within the same process, so any performance
> > overhead there is pretty minimal. If we need to go to the network to get
> > the attributes, then that was a case where the cache should have been
> > invalidated anyway, and we avoid having to check the validity of the
> > cache.
> 
> I've benchmarked a ~15% performance difference in IOPS between cache
> expiration time of 0 and 10 when running fio on a single file from a
> single client.
> 
> 

NFS iops? I'd guess more READ ops in particular? Is that with a
FSAL_CEPH backend?


> 
> > 
> > > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  
> > > wrote:
> > > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > > > Will Client query 'change' attribute every time before reading to know
> > > > > if the data has been changed?
> > > > > 
> > > > >   +-+++-+---+
> > > > >   | Name| ID | Data Type  | Acc | Defined in|
> > > > >   +-+++-+---+
> > > > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   |
> > > > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   |
> > > > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   |
> > > > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4   |
> > > > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5   |
> > > > >   | link_support| 5  | bool   | R   | Section 5.8.1.6   |
> > > > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   |
> > > > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   |
> > > > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   |
> > > > >   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  |
> > > > >   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  |
> > > > >   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  |
> > > > >   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  |
> > > > >   +-+++-+---+
> > > > > 
> > > > 
> > > > Not every time -- only when the cache needs revalidation.
> > > > 
> > > > In the absence of a delegation, that happens on a timeout (see the
> > > > acregmin/acregmax settings in nfs(5)), though things like opens and file
> > > > locking events also affect when the client revalidates.
> > > > 
> > > > When the v4 client does revalidate the cache, it relies heavily on NFSv4
> > > > change attribute. Cephfs's change attribute is cluster-coherent too, so
> > > > if the client does revalidate it should see changes made on other
> > > > servers.
> > > > 
> > > > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton  
> > > > > wrote:
> > > > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > > > Hi Jeff,
> > > > > > > Another question is about Client Caching when disabling 
> > > > > > > delegation.
> > > > > > > I set breakpoint on nfs4_op_read, which is OP_READ process 
> > > > > > > function in
> > > > > > > nfs-ganesha. Then I read a file, I found that it will hit only 
> > > > > > > once on
> > > > > > > the first time, which means latter reading operation on this file 
> > > > > > > will
> > > > > > > not trigger OP_READ. It will read the data from client side 
> > > > > > > cache. Is
> > > > > > > it right?
> > > > > > 
> > > > > > Yes. In the absence of a delegation, the client will periodically 
> > > > > > query
> > > > > > for the inode attributes, and will serve reads from the cache if it
> > > > > > looks like the file hasn't changed.
> > > > > > 
> > > > > > > I also checked the nfs client code in linux kernel. Only
> > > > > > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ 
> > > > > > > again,
> > > > > > > like this:
> > > > > > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> > > > > > > ret = nfs_invalidate_mapping(inode, mapping);
> > > > > > > }
> > > > > > > This about this senario, client1 connect ganesha1 and client2 
> > > > > > > connect
> > > > > > > ganesha2. I read /1.txt on client1 and client1 will cache the 
> > > > > > > data.
> > > > > > > Then I modify this file on client2. At that time, how client1 
> > > > > > > know the
> > > > > > > file is modifed and how it will add NFS_INO_INVALID_DATA into
> > > > > > > cache_validity?
> > > > > > 
> > > > > > Once you modify the code on client2, ganesha2 will request the 
> > > > > > necessary
> > > > > > caps from the ceph MDS, and client1 will have its caps revoked. 
> > > > > > It'll
> > > > > > then make the change.
> > > > > > 
> > > > > > When client1 reads again it will issue a GETATTR against the 

[ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-18 Thread David Turner
We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
(partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are
12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster
and 30 NVMe's in total.  They were all built at the same time and were
running firmware version QDV10130.  On this firmware version we early on
had 2 disks failures, a few months later we had 1 more, and then a month
after that (just a few weeks ago) we had 7 disk failures in 1 week.

The failures are such that the disk is no longer visible to the OS.  This
holds true beyond server reboots as well as placing the failed disks into a
new server.  With a firmware upgrade tool we got an error that pretty much
said there's no way to get data back and to RMA the disk.  We upgraded all
of our remaining disks' firmware to QDV101D1 and haven't had any problems
since then.  Most of our failures happened while rebalancing the cluster
after replacing dead disks and we tested rigorously around that use case
after upgrading the firmware.  This firmware version seems to have resolved
whatever the problem was.

We have about 100 more of these scattered among database servers and other
servers that have never had this problem while running the
QDV10130 firmware as well as firmwares between this one and the one we
upgraded to.  Bluestore on Ceph is the only use case we've had so far with
this sort of failure.

Has anyone else come across this issue before?  Our current theory is that
Bluestore is accessing the disk in a way that is triggering a bug in the
older firmware version that isn't triggered by more traditional
filesystems.  We have a scheduled call with Intel to discuss this, but
their preliminary searches into the bugfixes and known problems between
firmware versions didn't indicate the bug that we triggered.  It would be
good to have some more information about what those differences for disk
accessing might be to hopefully get a better answer from them as to what
the problem is.


[1]
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-18 Thread Paul Emmerich
>
> A call into libcephfs from ganesha to retrieve cached attributes is
> mostly just in-memory copies within the same process, so any performance
> overhead there is pretty minimal. If we need to go to the network to get
> the attributes, then that was a case where the cache should have been
> invalidated anyway, and we avoid having to check the validity of the
> cache.

I've benchmarked a ~15% performance difference in IOPS between cache
expiration time of 0 and 10 when running fio on a single file from a
single client.


Paul

>
>
> > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  wrote:
> > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > > Will Client query 'change' attribute every time before reading to know
> > > > if the data has been changed?
> > > >
> > > >   +-+++-+---+
> > > >   | Name| ID | Data Type  | Acc | Defined in|
> > > >   +-+++-+---+
> > > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   |
> > > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   |
> > > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   |
> > > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4   |
> > > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5   |
> > > >   | link_support| 5  | bool   | R   | Section 5.8.1.6   |
> > > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   |
> > > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   |
> > > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   |
> > > >   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  |
> > > >   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  |
> > > >   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  |
> > > >   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  |
> > > >   +-+++-+---+
> > > >
> > >
> > > Not every time -- only when the cache needs revalidation.
> > >
> > > In the absence of a delegation, that happens on a timeout (see the
> > > acregmin/acregmax settings in nfs(5)), though things like opens and file
> > > locking events also affect when the client revalidates.
> > >
> > > When the v4 client does revalidate the cache, it relies heavily on NFSv4
> > > change attribute. Cephfs's change attribute is cluster-coherent too, so
> > > if the client does revalidate it should see changes made on other
> > > servers.
> > >
> > > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton  
> > > > wrote:
> > > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > > Hi Jeff,
> > > > > > Another question is about Client Caching when disabling delegation.
> > > > > > I set breakpoint on nfs4_op_read, which is OP_READ process function 
> > > > > > in
> > > > > > nfs-ganesha. Then I read a file, I found that it will hit only once 
> > > > > > on
> > > > > > the first time, which means latter reading operation on this file 
> > > > > > will
> > > > > > not trigger OP_READ. It will read the data from client side cache. 
> > > > > > Is
> > > > > > it right?
> > > > >
> > > > > Yes. In the absence of a delegation, the client will periodically 
> > > > > query
> > > > > for the inode attributes, and will serve reads from the cache if it
> > > > > looks like the file hasn't changed.
> > > > >
> > > > > > I also checked the nfs client code in linux kernel. Only
> > > > > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ again,
> > > > > > like this:
> > > > > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> > > > > > ret = nfs_invalidate_mapping(inode, mapping);
> > > > > > }
> > > > > > This about this senario, client1 connect ganesha1 and client2 
> > > > > > connect
> > > > > > ganesha2. I read /1.txt on client1 and client1 will cache the data.
> > > > > > Then I modify this file on client2. At that time, how client1 know 
> > > > > > the
> > > > > > file is modifed and how it will add NFS_INO_INVALID_DATA into
> > > > > > cache_validity?
> > > > >
> > > > > Once you modify the code on client2, ganesha2 will request the 
> > > > > necessary
> > > > > caps from the ceph MDS, and client1 will have its caps revoked. It'll
> > > > > then make the change.
> > > > >
> > > > > When client1 reads again it will issue a GETATTR against the file [1].
> > > > > ganesha1 will then request caps to do the getattr, which will end up
> > > > > revoking ganesha2's caps. client1 will then see the change in 
> > > > > attributes
> > > > > (the change attribute and mtime, most likely) and will invalidate the
> > > > > mapping, causing it do reissue a READ on the wire.
> > > > >
> > > > > [1]: There may be a window of time after you 

Re: [ceph-users] CephFS: client hangs

2019-02-18 Thread Yan, Zheng
On Mon, Feb 18, 2019 at 10:55 PM Hennen, Christian
 wrote:
>
> Dear Community,
>
>
>
> we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs). During 
> setup, we made the mistake of configuring the OSDs on RAID Volumes. Initially 
> our cluster consisted of 3 nodes, each housing 1 OSD. Currently, we are in 
> the process of remediating this. After a loss of metadata 
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html) 
> due to resetting the journal (journal entries were not being flushed fast 
> enough), we managed to bring the cluster back up and started adding 2 
> additional nodes 
> (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html) .
>
>
>
> After adding the two additional nodes, we increased the number of placement 
> groups to not only accomodate the new nodes, but also to prepare for 
> reinstallation of the misconfigured nodes. Since then, the number of 
> placement groups per OSD is too high of course. Despite this fact, cluster 
> health remained fine over the last few months.
>
>
>
> However, we are currently observing massive problems: Whenever we try to 
> access any folder via CephFS, e.g. by listing its contents, there is no 
> response. Clients are getting blacklisted, but there is no warning. ceph -s 
> shows everything is ok, except for the number of PGs being too high. If I 
> grep for „assert“ or „error“ in any of the logs, nothing comes up. Also, it 
> is not possible to reduce the number of active MDS to 1. After issuing ‚ceph 
> fs set fs_data max_mds 1‘ nothing happens.
>
>
>
> Cluster details are available here: https://gitlab.uni-trier.de/snippets/77
>
>
>
> The MDS log  
> (https://gitlab.uni-trier.de/snippets/79?expanded=true=simple) 
> contains no „nicely exporting to“ messages as usual, but instead these:
>
> 2019-02-15 08:44:52.464926 7fdb13474700  7 mds.0.server 
> try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/ 
> [2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993 
> 80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260 10869=10202+667) 
> hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1 replicated=0 dirty=0 
> waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw to mds.1
>
>

MDS show a client got evicted. Nothing else looks abnormal.  Do new
cephfs clients also get evicted quickly?

>
> Updates from 12.2.8 to 12.2.11 I ran last week didn’t help.
>
>
>
> Anybody got an idea or a hint where I could look into next? Any help would be 
> greatly appreciated!
>
>
>
> Kind regards
>
> Christian Hennen
>
>
>
> Project Manager Infrastructural Services
> ZIMK University of Trier
>
> Germany
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-18 Thread David Turner
Also what commands did you run to remove the failed HDDs and the commands
you have so far run to add their replacements back in?

On Sat, Feb 16, 2019 at 9:55 PM Konstantin Shalygin  wrote:

> I recently replaced failed HDDs and removed them from their respective
> buckets as per procedure.
>
> But I’m now facing an issue when trying to place new ones back into the
> buckets. I’m getting an error of ‘osd nr not found’ OR ‘file or
> directory not found’ OR command sintax error.
>
> I have been using the commands below:
>
> ceph osd crush set   
> ceph osd crush  set   
>
> I do however find the OSD number when i run command:
>
> ceph osd find 
>
> Your assistance/response to this will be highly appreciated.
>
> Regards
> John.
>
>
> Please, paste your `ceph osd tree`, your version and what exactly error
> you get include osd number.
>
> Less obfuscation is better in this, perhaps, simple case.
>
>
> k
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: client hangs

2019-02-18 Thread Ashley Merrick
I know this may sound simple.

Have you tried raising the PG per an OSD limit, I'm sure I have seen in the
past people with the same kind of issue as you and was just I/O being
blocked due to a limit but not actively logged.

mon_max_pg_per_osd = 400

In the ceph.conf and then restart all the services / or inject the config
into the running admin

On Mon, Feb 18, 2019 at 10:55 PM Hennen, Christian <
christian.hen...@uni-trier.de> wrote:

> Dear Community,
>
>
>
> we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs).
> During setup, we made the mistake of configuring the OSDs on RAID Volumes.
> Initially our cluster consisted of 3 nodes, each housing 1 OSD. Currently,
> we are in the process of remediating this. After a loss of metadata (
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html)
> due to resetting the journal (journal entries were not being flushed fast
> enough), we managed to bring the cluster back up and started adding 2
> additional nodes (
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html)
> .
>
>
>
> After adding the two additional nodes, we increased the number of
> placement groups to not only accomodate the new nodes, but also to prepare
> for reinstallation of the misconfigured nodes. Since then, the number of
> placement groups per OSD is too high of course. Despite this fact, cluster
> health remained fine over the last few months.
>
>
>
> However, we are currently observing massive problems: Whenever we try to
> access any folder via CephFS, e.g. by listing its contents, there is no
> response. Clients are getting blacklisted, but there is no warning. ceph -s
> shows everything is ok, except for the number of PGs being too high. If I
> grep for „assert“ or „error“ in any of the logs, nothing comes up. Also, it
> is not possible to reduce the number of active MDS to 1. After issuing
> ‚ceph fs set fs_data max_mds 1‘ nothing happens.
>
>
>
> Cluster details are available here:
> https://gitlab.uni-trier.de/snippets/77
>
>
>
> The MDS log  (
> https://gitlab.uni-trier.de/snippets/79?expanded=true=simple)
> contains no „nicely exporting to“ messages as usual, but instead these:
>
> 2019-02-15 08:44:52.464926 7fdb13474700  7 mds.0.server
> try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/
> [2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993
> 80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260
> 10869=10202+667) hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1
> replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw
> to mds.1
>
>
>
> Updates from 12.2.8 to 12.2.11 I ran last week didn’t help.
>
>
>
> Anybody got an idea or a hint where I could look into next? Any help would
> be greatly appreciated!
>
>
>
> Kind regards
>
> Christian Hennen
>
>
>
> Project Manager Infrastructural Services
> ZIMK University of Trier
>
> Germany
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: client hangs

2019-02-18 Thread Hennen, Christian
Dear Community,

 

we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs). During
setup, we made the mistake of configuring the OSDs on RAID Volumes.
Initially our cluster consisted of 3 nodes, each housing 1 OSD. Currently,
we are in the process of remediating this. After a loss of metadata
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html)
due to resetting the journal (journal entries were not being flushed fast
enough), we managed to bring the cluster back up and started adding 2
additional nodes
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html)
.

 

After adding the two additional nodes, we increased the number of placement
groups to not only accomodate the new nodes, but also to prepare for
reinstallation of the misconfigured nodes. Since then, the number of
placement groups per OSD is too high of course. Despite this fact, cluster
health remained fine over the last few months.

 

However, we are currently observing massive problems: Whenever we try to
access any folder via CephFS, e.g. by listing its contents, there is no
response. Clients are getting blacklisted, but there is no warning. ceph -s
shows everything is ok, except for the number of PGs being too high. If I
grep for "assert" or "error" in any of the logs, nothing comes up. Also, it
is not possible to reduce the number of active MDS to 1. After issuing ,ceph
fs set fs_data max_mds 1' nothing happens.

 

Cluster details are available here: https://gitlab.uni-trier.de/snippets/77 

 

The MDS log  (https://gitlab.uni-trier.de/snippets/79?expanded=true

=simple) contains no "nicely exporting to" messages as usual, but
instead these:

2019-02-15 08:44:52.464926 7fdb13474700  7 mds.0.server
try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/
[2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993
80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260
10869=10202+667) hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1
replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw
to mds.1

 

Updates from 12.2.8 to 12.2.11 I ran last week didn't help.

 

Anybody got an idea or a hint where I could look into next? Any help would
be greatly appreciated!

 

Kind regards

Christian Hennen

 

Project Manager Infrastructural Services
ZIMK University of Trier

Germany



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread David Turner
Do you have historical data from these OSDs to see when/if the DB used on
osd.73 ever filled up?  To account for this OSD using the slow storage for
DB, all we need to do is show that it filled up the fast DB at least once.
If that happened, then something spilled over to the slow storage and has
been there ever since.

On Sat, Feb 16, 2019 at 1:50 AM Konstantin Shalygin  wrote:

> On 2/16/19 12:33 AM, David Turner wrote:
> > The answer is probably going to be in how big your DB partition is vs
> > how big your HDD disk is.  From your output it looks like you have a
> > 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used size
> > isn't currently full, I would guess that at some point since this OSD
> > was created that it did fill up and what you're seeing is the part of
> > the DB that spilled over to the data disk. This is why the official
> > recommendation (that is quite cautious, but cautious because some use
> > cases will use this up) for a blocks.db partition is 4% of the data
> > drive.  For your 6TB disks that's a recommendation of 240GB per DB
> > partition.  Of course the actual size of the DB needed is dependent on
> > your use case.  But pretty much every use case for a 6TB disk needs a
> > bigger partition than 28GB.
>
>
> My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> 2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> 6388Mbyte (6.69% of db_total_bytes).
>
> Why osd.33 is not used slow storage at this case?
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems with osd creation in Ubuntu 18.04, ceph 13.2.4-1bionic

2019-02-18 Thread Alfredo Deza
On Mon, Feb 18, 2019 at 2:46 AM Rainer Krienke  wrote:
>
> Hello,
>
> thanks for your answer, but zapping the disk did not make any
> difference. I still get the same error.  Looking at the debug output I
> found this error message that is probably the root of all trouble:
>
> # ceph-volume lvm prepare --bluestore --data /dev/sdg
> 
> stderr: 2019-02-18 08:29:25.544 7fdaa50ed240 -1
> bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid

This "unparsable uuid" line is (unfortunately) expected from
bluestore, and will show up when the OSD is being created for the
first time.

The error messaging was improved a bit (see
https://tracker.ceph.com/issues/22285 and PR
https://github.com/ceph/ceph/pull/20090 )

>
> I found the bugreport below that seems to be exactly that problem I have:
> http://tracker.ceph.com/issues/15386

This doesn't look like the same thing, you are hitting an assert:

 stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: In
function 'virtual int KernelDevice::read(uint64_t, uint64_t,
ceph::bufferlist*, IOContext*, bool)' thread 7f3fcecb3240 time
2019-02-14 13:45:54.841130
 stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: 821:
FAILED assert((uint64_t)r == len)

Which looks like a valid issue to me, might want to go and create a
new ticket in

https://tracker.ceph.com/projects/bluestore/issues/new


>
> However there seems to be no solution  up to now.
>
> Does anyone have more information how to get around this problem?
>
> Thanks
> Rainer
>
> Am 15.02.19 um 18:12 schrieb David Turner:
> > I have found that running a zap before all prepare/create commands with
> > ceph-volume helps things run smoother.  Zap is specifically there to
> > clear everything on a disk away to make the disk ready to be used as an
> > OSD.  Your wipefs command is still fine, but then I would lvm zap the
> > disk before continuing.  I would run the commands like [1] this.  I also
> > prefer the single command lvm create as opposed to lvm prepare and lvm
> > activate.  Try that out and see if you still run into the problems
> > creating the BlueStore filesystem.
> >
> > [1] ceph-volume lvm zap /dev/sdg
> > ceph-volume lvm prepare --bluestore --data /dev/sdg
> >
> > On Thu, Feb 14, 2019 at 10:25 AM Rainer Krienke  > > wrote:
> >
> > Hi,
> >
> > I am quite new to ceph and just try to set up a ceph cluster. Initially
> > I used ceph-deploy for this but when I tried to create a BlueStore osd
> > ceph-deploy fails. Next I tried the direct way on one of the OSD-nodes
> > using ceph-volume to create the osd, but this also fails. Below you can
> > see what  ceph-volume says.
> >
> > I ensured that there was no left over lvm VG and LV on the disk sdg
> > before I started the osd creation for this disk. The very same error
> > happens also on other disks not just for /dev/sdg. All the disk have 4TB
> > in size and the linux system is Ubuntu 18.04 and finally ceph is
> > installed in version 13.2.4-1bionic from this repo:
> > https://download.ceph.com/debian-mimic.
> >
> > There is a VG and two LV's  on the system for the ubuntu system itself
> > that is installed on two separate disks configured as software raid1 and
> > lvm on top of the raid. But I cannot imagine that this might do any harm
> > to cephs osd creation.
> >
> > Does anyone have an idea what might be wrong?
> >
> > Thanks for hints
> > Rainer
> >
> > root@ceph1:~# wipefs -fa /dev/sdg
> > root@ceph1:~# ceph-volume lvm prepare --bluestore --data /dev/sdg
> > Running command: /usr/bin/ceph-authtool --gen-print-key
> > Running command: /usr/bin/ceph --cluster ceph --name
> > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> > -i - osd new 14d041d6-0beb-4056-8df2-3920e2febce0
> > Running command: /sbin/vgcreate --force --yes
> > ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b /dev/sdg
> >  stdout: Physical volume "/dev/sdg" successfully created.
> >  stdout: Volume group "ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b"
> > successfully created
> > Running command: /sbin/lvcreate --yes -l 100%FREE -n
> > osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> > ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b
> >  stdout: Logical volume "osd-block-14d041d6-0beb-4056-8df2-3920e2febce0"
> > created.
> > Running command: /usr/bin/ceph-authtool --gen-print-key
> > Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
> > --> Absolute path not found for executable: restorecon
> > --> Ensure $PATH environment variable contains common executable
> > locations
> > Running command: /bin/chown -h ceph:ceph
> > 
> > /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> > Running command: /bin/chown -R ceph:ceph /dev/dm-8
> > Running command: /bin/ln -s
> > 
> > 

[ceph-users] Setting rados_osd_op_timeout with RGW

2019-02-18 Thread Wido den Hollander
Hi,

Has anybody ever tried or does know how safe it is to set
'rados_osd_op_timeout' in a RGW-only situation?

Right now, if one PG becomes inactive or OSDs are super slow the RGW
will start to block at some point since the RADOS operations will never
time out.

Using rados_osd_op_timeout you can say that after X seconds the
operation needs to be aborted from the client's side.

How safe or dangerous would it be to set this on RGW? This way you would
still serve data from PGs which are available instead of having your
complete RGW go down due to one PG blocking.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread M Ranga Swami Reddy
OK, sure will restart the ceph-mon (starting from non leader first,
and then last leader node).


On Mon, Feb 18, 2019 at 4:59 PM Dan van der Ster  wrote:
>
> Not really.
>
> You should just restart your mons though -- if done one at a time it
> has zero impact on your clients.
>
> -- dan
>
>
> On Mon, Feb 18, 2019 at 12:11 PM M Ranga Swami Reddy
>  wrote:
> >
> > Hi Sage - If the mon data increases, is this impacts the ceph cluster
> > performance (ie on ceph osd bench, etc)?
> >
> > On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > today I again hit the warn with 30G also...
> > >
> > > On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
> > > >
> > > > On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > > > > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> > > > >  wrote:
> > > > > >
> > > > > > Hi Dan,
> > > > > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > > > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > > >But the intended behavior is that once the PGs are all 
> > > > > > >active+clean,
> > > > > > >the old maps should be trimmed and the disk space freed.
> > > > > >
> > > > > > old maps not trimmed after cluster reached to "all+clean" state for 
> > > > > > all PGs.
> > > > > > Is there (known) bug here?
> > > > > > As the size of dB showing > 15G, do I need to run the compact 
> > > > > > commands
> > > > > > to do the trimming?
> > > > >
> > > > > Compaction isn't necessary -- you should only need to restart all
> > > > > peon's then the leader. A few minutes later the db's should start
> > > > > trimming.
> > > >
> > > > The next time someone sees this behavior, can you please
> > > >
> > > > - enable debug_mon = 20 on all mons (*before* restarting)
> > > >ceph tell mon.* injectargs '--debug-mon 20'
> > > > - wait for 10 minutes or so to generate some logs
> > > > - add 'debug mon = 20' to ceph.conf (on mons only)
> > > > - restart the monitors
> > > > - wait for them to start trimming
> > > > - remove 'debug mon = 20' from ceph.conf (on mons only)
> > > > - tar up the log files, ceph-post-file them, and share them with ticket
> > > > http://tracker.ceph.com/issues/38322
> > > >
> > > > Thanks!
> > > > sage
> > > >
> > > >
> > > >
> > > >
> > > > > -- dan
> > > > >
> > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Swami
> > > > > >
> > > > > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > With HEALTH_OK a mon data dir should be under 2GB for even such a 
> > > > > > > large cluster.
> > > > > > >
> > > > > > > During backfilling scenarios, the mons keep old maps and grow 
> > > > > > > quite
> > > > > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > > > But the intended behavior is that once the PGs are all 
> > > > > > > active+clean,
> > > > > > > the old maps should be trimmed and the disk space freed.
> > > > > > >
> > > > > > > However, several people have noted that (at least in luminous
> > > > > > > releases) the old maps are not trimmed until after HEALTH_OK 
> > > > > > > *and* all
> > > > > > > mons are restarted. This ticket seems related:
> > > > > > > http://tracker.ceph.com/issues/37875
> > > > > > >
> > > > > > > (Over here we're restarting mons every ~2-3 weeks, resulting in 
> > > > > > > the
> > > > > > > mon stores dropping from >15GB to ~700MB each time).
> > > > > > >
> > > > > > > -- Dan
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Swami
> > > > > > > >
> > > > > > > > The limit is somewhat arbitrary, based on cluster sizes we had 
> > > > > > > > seen when
> > > > > > > > we picked it.  In your case it should be perfectly safe to 
> > > > > > > > increase it.
> > > > > > > >
> > > > > > > > sage
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > > > > >
> > > > > > > > > Hello -  Are the any limits for mon_data_size for cluster 
> > > > > > > > > with 2PB
> > > > > > > > > (with 2000+ OSDs)?
> > > > > > > > >
> > > > > > > > > Currently it set as 15G. What is logic behind this? Can we 
> > > > > > > > > increase
> > > > > > > > > when we get the mon_data_size_warn messages?
> > > > > > > > >
> > > > > > > > > I am getting the mon_data_size_warn message even though there 
> > > > > > > > > a ample
> > > > > > > > > of free space on the disk (around 300G free disk)
> > > > > > > > >
> > > > > > > > > Earlier thread on the same discusion:
> > > > > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Swami
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > 

Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Dan van der Ster
Not really.

You should just restart your mons though -- if done one at a time it
has zero impact on your clients.

-- dan


On Mon, Feb 18, 2019 at 12:11 PM M Ranga Swami Reddy
 wrote:
>
> Hi Sage - If the mon data increases, is this impacts the ceph cluster
> performance (ie on ceph osd bench, etc)?
>
> On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy
>  wrote:
> >
> > today I again hit the warn with 30G also...
> >
> > On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
> > >
> > > On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > > > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> > > >  wrote:
> > > > >
> > > > > Hi Dan,
> > > > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > >But the intended behavior is that once the PGs are all active+clean,
> > > > > >the old maps should be trimmed and the disk space freed.
> > > > >
> > > > > old maps not trimmed after cluster reached to "all+clean" state for 
> > > > > all PGs.
> > > > > Is there (known) bug here?
> > > > > As the size of dB showing > 15G, do I need to run the compact commands
> > > > > to do the trimming?
> > > >
> > > > Compaction isn't necessary -- you should only need to restart all
> > > > peon's then the leader. A few minutes later the db's should start
> > > > trimming.
> > >
> > > The next time someone sees this behavior, can you please
> > >
> > > - enable debug_mon = 20 on all mons (*before* restarting)
> > >ceph tell mon.* injectargs '--debug-mon 20'
> > > - wait for 10 minutes or so to generate some logs
> > > - add 'debug mon = 20' to ceph.conf (on mons only)
> > > - restart the monitors
> > > - wait for them to start trimming
> > > - remove 'debug mon = 20' from ceph.conf (on mons only)
> > > - tar up the log files, ceph-post-file them, and share them with ticket
> > > http://tracker.ceph.com/issues/38322
> > >
> > > Thanks!
> > > sage
> > >
> > >
> > >
> > >
> > > > -- dan
> > > >
> > > >
> > > > >
> > > > > Thanks
> > > > > Swami
> > > > >
> > > > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > With HEALTH_OK a mon data dir should be under 2GB for even such a 
> > > > > > large cluster.
> > > > > >
> > > > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > > But the intended behavior is that once the PGs are all active+clean,
> > > > > > the old maps should be trimmed and the disk space freed.
> > > > > >
> > > > > > However, several people have noted that (at least in luminous
> > > > > > releases) the old maps are not trimmed until after HEALTH_OK *and* 
> > > > > > all
> > > > > > mons are restarted. This ticket seems related:
> > > > > > http://tracker.ceph.com/issues/37875
> > > > > >
> > > > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > > > mon stores dropping from >15GB to ~700MB each time).
> > > > > >
> > > > > > -- Dan
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > > > >
> > > > > > > Hi Swami
> > > > > > >
> > > > > > > The limit is somewhat arbitrary, based on cluster sizes we had 
> > > > > > > seen when
> > > > > > > we picked it.  In your case it should be perfectly safe to 
> > > > > > > increase it.
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > > > >
> > > > > > > > Hello -  Are the any limits for mon_data_size for cluster with 
> > > > > > > > 2PB
> > > > > > > > (with 2000+ OSDs)?
> > > > > > > >
> > > > > > > > Currently it set as 15G. What is logic behind this? Can we 
> > > > > > > > increase
> > > > > > > > when we get the mon_data_size_warn messages?
> > > > > > > >
> > > > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > > > ample
> > > > > > > > of free space on the disk (around 300G free disk)
> > > > > > > >
> > > > > > > > Earlier thread on the same discusion:
> > > > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Swami
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > ___
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@lists.ceph.com
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > >
> > > >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread Dan van der Ster
On Thu, Feb 14, 2019 at 2:31 PM Sage Weil  wrote:
>
> On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Hi Dan,
> > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > >But the intended behavior is that once the PGs are all active+clean,
> > > >the old maps should be trimmed and the disk space freed.
> > >
> > > old maps not trimmed after cluster reached to "all+clean" state for all 
> > > PGs.
> > > Is there (known) bug here?
> > > As the size of dB showing > 15G, do I need to run the compact commands
> > > to do the trimming?
> >
> > Compaction isn't necessary -- you should only need to restart all
> > peon's then the leader. A few minutes later the db's should start
> > trimming.
>
> The next time someone sees this behavior, can you please
>
> - enable debug_mon = 20 on all mons (*before* restarting)
>ceph tell mon.* injectargs '--debug-mon 20'
> - wait for 10 minutes or so to generate some logs
> - add 'debug mon = 20' to ceph.conf (on mons only)
> - restart the monitors
> - wait for them to start trimming
> - remove 'debug mon = 20' from ceph.conf (on mons only)
> - tar up the log files, ceph-post-file them, and share them with ticket
> http://tracker.ceph.com/issues/38322
>

Not sure if you noticed, but we sent some logs Friday.

-- dan

> Thanks!
> sage
>
>
>
>
> > -- dan
> >
> >
> > >
> > > Thanks
> > > Swami
> > >
> > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > With HEALTH_OK a mon data dir should be under 2GB for even such a large 
> > > > cluster.
> > > >
> > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > But the intended behavior is that once the PGs are all active+clean,
> > > > the old maps should be trimmed and the disk space freed.
> > > >
> > > > However, several people have noted that (at least in luminous
> > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all
> > > > mons are restarted. This ticket seems related:
> > > > http://tracker.ceph.com/issues/37875
> > > >
> > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > mon stores dropping from >15GB to ~700MB each time).
> > > >
> > > > -- Dan
> > > >
> > > >
> > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > >
> > > > > Hi Swami
> > > > >
> > > > > The limit is somewhat arbitrary, based on cluster sizes we had seen 
> > > > > when
> > > > > we picked it.  In your case it should be perfectly safe to increase 
> > > > > it.
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > >
> > > > > > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > > > > > (with 2000+ OSDs)?
> > > > > >
> > > > > > Currently it set as 15G. What is logic behind this? Can we increase
> > > > > > when we get the mon_data_size_warn messages?
> > > > > >
> > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > ample
> > > > > > of free space on the disk (around 300G free disk)
> > > > > >
> > > > > > Earlier thread on the same discusion:
> > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > >
> > > > > > Thanks
> > > > > > Swami
> > > > > >
> > > > > >
> > > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-18 Thread M Ranga Swami Reddy
Hi Sage - If the mon data increases, is this impacts the ceph cluster
performance (ie on ceph osd bench, etc)?

On Fri, Feb 15, 2019 at 3:13 PM M Ranga Swami Reddy
 wrote:
>
> today I again hit the warn with 30G also...
>
> On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
> >
> > On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> > >  wrote:
> > > >
> > > > Hi Dan,
> > > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > >But the intended behavior is that once the PGs are all active+clean,
> > > > >the old maps should be trimmed and the disk space freed.
> > > >
> > > > old maps not trimmed after cluster reached to "all+clean" state for all 
> > > > PGs.
> > > > Is there (known) bug here?
> > > > As the size of dB showing > 15G, do I need to run the compact commands
> > > > to do the trimming?
> > >
> > > Compaction isn't necessary -- you should only need to restart all
> > > peon's then the leader. A few minutes later the db's should start
> > > trimming.
> >
> > The next time someone sees this behavior, can you please
> >
> > - enable debug_mon = 20 on all mons (*before* restarting)
> >ceph tell mon.* injectargs '--debug-mon 20'
> > - wait for 10 minutes or so to generate some logs
> > - add 'debug mon = 20' to ceph.conf (on mons only)
> > - restart the monitors
> > - wait for them to start trimming
> > - remove 'debug mon = 20' from ceph.conf (on mons only)
> > - tar up the log files, ceph-post-file them, and share them with ticket
> > http://tracker.ceph.com/issues/38322
> >
> > Thanks!
> > sage
> >
> >
> >
> >
> > > -- dan
> > >
> > >
> > > >
> > > > Thanks
> > > > Swami
> > > >
> > > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > With HEALTH_OK a mon data dir should be under 2GB for even such a 
> > > > > large cluster.
> > > > >
> > > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > > But the intended behavior is that once the PGs are all active+clean,
> > > > > the old maps should be trimmed and the disk space freed.
> > > > >
> > > > > However, several people have noted that (at least in luminous
> > > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all
> > > > > mons are restarted. This ticket seems related:
> > > > > http://tracker.ceph.com/issues/37875
> > > > >
> > > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > > mon stores dropping from >15GB to ~700MB each time).
> > > > >
> > > > > -- Dan
> > > > >
> > > > >
> > > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > > >
> > > > > > Hi Swami
> > > > > >
> > > > > > The limit is somewhat arbitrary, based on cluster sizes we had seen 
> > > > > > when
> > > > > > we picked it.  In your case it should be perfectly safe to increase 
> > > > > > it.
> > > > > >
> > > > > > sage
> > > > > >
> > > > > >
> > > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > > >
> > > > > > > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > > > > > > (with 2000+ OSDs)?
> > > > > > >
> > > > > > > Currently it set as 15G. What is logic behind this? Can we 
> > > > > > > increase
> > > > > > > when we get the mon_data_size_warn messages?
> > > > > > >
> > > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > > ample
> > > > > > > of free space on the disk (around 300G free disk)
> > > > > > >
> > > > > > > Earlier thread on the same discusion:
> > > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > > >
> > > > > > > Thanks
> > > > > > > Swami
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > ___
> > > > > > ceph-users mailing list
> > > > > > ceph-users@lists.ceph.com
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore increased disk usage

2019-02-18 Thread Jan Kasprzak
Jakub Jaszewski wrote:
: Hi Yenya,
: 
: I guess Ceph adds the size of all  your data.db devices to the cluster
: total used space.

Jakub,

thanks for the hint. The disk usage increase almost corresponds
to that - I have added about 7.5 TB of data.db devices with the last
batch of OSDs.

Sincerely,

-Yenya

: pt., 8 lut 2019, 10:11: Jan Kasprzak  napisał(a):
: 
: > Hello, ceph users,
: >
: > I moved my cluster to bluestore (Ceph Mimic), and now I see the increased
: > disk usage. From ceph -s:
: >
: > pools:   8 pools, 3328 pgs
: > objects: 1.23 M objects, 4.6 TiB
: > usage:   23 TiB used, 444 TiB / 467 TiB avail
: >
: > I use 3-way replication of my data, so I would expect the disk usage
: > to be around 14 TiB. Which was true when I used filestore-based Luminous
: > OSDs
: > before. Why the disk usage now is 23 TiB?
: >
: > If I remember it correctly (a big if!), the disk usage was about the same
: > when I originally moved the data to empty bluestore OSDs by changing the
: > crush rule, but went up after I have added more bluestore OSDs and the
: > cluster
: > rebalanced itself.
: >
: > Could it be some miscalculation of free space in bluestore? Also, could it
: > be
: > related to the HEALTH_ERR backfill_toofull problem discused here in the
: > other
: > thread?
: >
: > Thanks,
: >
: > -Yenya
: >
: > --
: > | Jan "Yenya" Kasprzak 
: > |
: > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5
: > |
: >  This is the world we live in: the way to deal with computers is to google
: >  the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
: > ___
: > ceph-users mailing list
: > ceph-users@lists.ceph.com
: > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
: >

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding EC properties for CephFS / small files.

2019-02-18 Thread Paul Emmerich
Inline data is officially an experimental feature. I know of a
production cluster that's running with inline data enabled, no
problems so far (but it was only enabled two months ago or so).

You can reduce the bluestore min alloc size; it's only 16kb for SSDs
by default. But the main overhead will then be the metadata required
for every object.


Paul



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Feb 18, 2019 at 7:06 AM  wrote:
>
> Hi Paul.
>
> Thanks for you comments.
>
> > For your examples:
> >
> > 16 MB file -> 4x 4 MB objects -> 4x 4x 1 MB data chunks, 4x 2x 1 MB
> > coding chunks
> >
> > 512 kB file -> 1x 512 kB object -> 4x 128 kB data chunks, 2x 128 kb
> > coding chunks
> >
> >
> > You'll run into different problems once the erasure coded chunks end
> > up being smaller than 64kb each due to bluestore min allocation sizes
> > and general metadata overhead making erasure coding a bad fit for very
> > small files.
>
> Thanks for the clairification, which makes this a "very bad fit" for CephFS:
>
> # find . -type f -print0 | xargs -0 stat | grep Size | perl -ane '/Size:
> (\d+)/; print $1 . "\n";' | ministat -n
> x 
> N   Min   MaxMedian   AvgStddev
> x 12651568 0 1.0840049e+11  9036 2217611.6
> 32397960
>
> Gives me 6,3M files < 9036 bytes in size, that'll be stored as 6 x 64KB at
> the bluestore
> level if I understand it correctly.
>
> We come from a xfs world where default blocksize is 4K so above situation
> worked quite nicely. Guess I probably would be way better off with a
> RBD with xfs on top to solve this case using Ceph.
>
> Is it fair to summarize your input as:
>
> In a EC4+2 configuration, minimal used space is 256KB+128KB(coding)
> regardless of file-size
> In a EC8+3 configuraiton, minimal used space is 512KB+192KB(coding)
> regardless of file-size
>
> And for the access side:
> All access to files in EC pool requires as a minimum IO-requests to
> k-shards for the first
> bytes to be returned, with fast_read it becomes k+n, but returns when k
> has responded.
>
> Any experience with inlining data on the MDS - that would obviously help
> here I guess.
>
> Thanks.
>
> --
> Jesper
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com