Re: [ceph-users] Ceph nautilus upgrade problem

2019-04-02 Thread Stefan Kooman
Quoting Paul Emmerich (paul.emmer...@croit.io):
> This also happened sometimes during a Luminous -> Mimic upgrade due to
> a bug in Luminous; however I thought it was fixed on the ceph-mgr
> side.
> Maybe the fix was (also) required in the OSDs and you are seeing this
> because the running OSDs have that bug?
> 
> Anyways, it's harmless and you can ignore it.

Ah, so it's merely "cosmetic" than that those PGs are really inactive.
Because, that would *freak me out* if I were doing an upgrade.

Thanks for the clarification.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] op_w_latency

2019-04-02 Thread Glen Baars
Thanks for the updated command – much cleaner!

The OSD nodes have a single 6core X5650 @ 2.67GHz, 72GB GB and around 8x10TB 
HDD OSD/ 4 x 2TB SSD OSD. Cpu usage is around 20% and the ram has 22GB 
available.
The 3 MON nodes are the same but with no OSDs
The cluster has around 150 drives and only doing 500-1000 ops overall.
The network is dual 10gbit using lacp. Vlan for private ceph traffic and 
untagged for public

Glen
From: Konstantin Shalygin 
Sent: Wednesday, 3 April 2019 11:39 AM
To: Glen Baars 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] op_w_latency


Hello Ceph Users,



I am finding that the write latency across my ceph clusters isn't great and I 
wanted to see what other people are getting for op_w_latency. Generally I am 
getting 70-110ms latency.



I am using: ceph --admin-daemon /var/run/ceph/ceph-osd.102.asok perf dump | 
grep -A3 '\"op_w_latency' | grep 'avgtime'

Better like this:

ceph daemon osd.102 perf dump | jq '.osd.op_w_latency.avgtime'



Ram, CPU and network don't seem to be the bottleneck. The drives are behind a 
dell H810p raid card with a 1GB writeback cache and battery. I have tried with 
LSI JBOD cards and haven't found it faster ( as you would expect with write 
cache ). The disks through iostat -xyz 1 show 10-30% usage with general service 
+ write latency around 3-4ms. Queue depth is normally less than one. RocksDB 
write latency is around 0.6ms, read 1-2ms. Usage is RBD backend for Cloudstack.


What is your hardware? Your CPU, RAM, Eth?





k

This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] op_w_latency

2019-04-02 Thread Konstantin Shalygin

Hello Ceph Users,

I am finding that the write latency across my ceph clusters isn't great and I 
wanted to see what other people are getting for op_w_latency. Generally I am 
getting 70-110ms latency.

I am using: ceph --admin-daemon /var/run/ceph/ceph-osd.102.asok perf dump | grep -A3 
'\"op_w_latency' | grep 'avgtime'


Better like this:

ceph daemon osd.102 perf dump | jq '.osd.op_w_latency.avgtime'


Ram, CPU and network don't seem to be the bottleneck. The drives are behind a 
dell H810p raid card with a 1GB writeback cache and battery. I have tried with 
LSI JBOD cards and haven't found it faster ( as you would expect with write 
cache ). The disks through iostat -xyz 1 show 10-30% usage with general service 
+ write latency around 3-4ms. Queue depth is normally less than one. RocksDB 
write latency is around 0.6ms, read 1-2ms. Usage is RBD backend for Cloudstack.


What is your hardware? Your CPU, RAM, Eth?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

2019-04-02 Thread Yan, Zheng
Looks like http://tracker.ceph.com/issues/37399. which version of
ceph-mds do you use?

On Tue, Apr 2, 2019 at 7:47 AM Sergey Malinin  wrote:
>
> These steps pretty well correspond to 
> http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> Were you able to replay journal manually with no issues? IIRC, 
> "cephfs-journal-tool recover_dentries" would lead to OOM in case of MDS doing 
> so, and it has already been discussed on this list.
>
>
> April 2, 2019 1:37 AM, "Pickett, Neale T"  wrote:
>
> Here is what I wound up doing to fix this:
>
> Bring down all MDSes so they stop flapping
> Back up journal (as seen in previous message)
> Apply journal manually
> Reset journal manually
> Clear session table
> Clear other tables (not sure I needed to do this)
> Mark FS down
> Mark the rank 0 MDS as failed
> Reset the FS (yes, I really mean it)
> Restart MDSes
> Finally get some sleep
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-02 Thread Christian Balzer
On Tue, 2 Apr 2019 19:04:28 +0900 Hector Martin wrote:

> On 02/04/2019 18.27, Christian Balzer wrote:
> > I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
> > pool with 1024 PGs.  
> 
> (20 choose 2) is 190, so you're never going to have more than that many 
> unique sets of OSDs.
> 
And this is why one shouldn't send mails when in a rush, w/o fully groking
the math one was just given. 
Thanks for setting me straight. 

> I just looked at the OSD distribution for a replica 3 pool across 48 
> OSDs with 4096 PGs that I have and the result is reasonable. There are 
> 3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this 
> is a random process, due to the birthday paradox, some duplicates are 
> expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs 
> having 3782 unique choices seems to pass the gut feeling test. Too lazy 
> to do the math closed form, but here's a quick simulation:
> 
>  >>> len(set(random.randrange(17296) for i in range(4096)))  
> 3671
> 
> So I'm actually slightly ahead.
> 
> At the numbers in my previous example (1500 OSDs, 50k pool PGs), 
> statistically you should get something like ~3 collisions on average, so 
> negligible.
> 
Sounds promising. 

> > Another thing to look at here is of course critical period and disk
> > failure probabilities, these guys explain the logic behind their
> > calculator, would be delighted if you could have a peek and comment.
> > 
> > https://www.memset.com/support/resources/raid-calculator/  
> 
> I'll take a look tonight :)
> 
Thanks, a look at the Backblaze disk failure rates (picking the worst
ones) gives a good insight into real life probabilities, too.
https://www.backblaze.com/blog/hard-drive-stats-for-2018/
If we go with 2%/year, that's an average failure ever 12 days.

Aside from how likely the actual failure rate is, another concern of
course is extended periods of the cluster being unhealthy, with certain
versions there was that "mon map will grow indefinitely" issue, other more
subtle ones might lurk still.

Christian
> -- 
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nautilus upgrade problem

2019-04-02 Thread Paul Emmerich
This also happened sometimes during a Luminous -> Mimic upgrade due to
a bug in Luminous; however I thought it was fixed on the ceph-mgr
side.
Maybe the fix was (also) required in the OSDs and you are seeing this
because the running OSDs have that bug?

Anyways, it's harmless and you can ignore it.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Apr 2, 2019 at 7:49 PM Jan-Willem Michels  wrote:
>
> Op 2-4-2019 om 12:16 schreef Stefan Kooman:
> > Quoting Stadsnet (jwil...@stads.net):
> >> On 26-3-2019 16:39, Ashley Merrick wrote:
> >>> Have you upgraded any OSD's?
> >>
> >> No didn't go through with the osd's
> > Just checking here: are your sure all PGs have been scrubbed while
> > running Luminous? As the release notes [1] mention this:
> >
> > "If you are unsure whether or not your Luminous cluster has completed a
> > full scrub of all PGs, you can check your clusters state by running:
> >
> > # ceph osd dump | grep ^flags
> >
> > In order to be able to proceed to Nautilus, your OSD map must include
> > the recovery_deletes and purged_snapdirs flags."
>
> Yes I did check that.
>
> No everything went fine, exactly as Ashley predicted
>
> "On a test cluster I saw the same and as I upgraded / restarted the
> OSD's the PG's started to show online till it was 100%."
>
> So I upgraded the first osd, and exactly that amount of percentage of
> OSD's became active.
> And every server the same percentage was added.
> And then finaly, with the last one I got 100% active.
>
> So went without problems.
> But it looked a bit uggly that's why I asked.
>
> And the new Nautilus versions is really a big plus in almost every way.
>
> Sorry for not getting back how it went. I was not sure if I should
> bother the mailing list.
>
> Thanks for your time.
>
>
> >
> > Gr. Stefan
> >
> > [1]:
> > http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous
> >
> > P.s. I expect most users upgrade to Mimic first, then go to Nautilus.
> > It might be a better tested upgrade path ...
> >
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nautilus upgrade problem

2019-04-02 Thread Jan-Willem Michels

Op 2-4-2019 om 12:16 schreef Stefan Kooman:

Quoting Stadsnet (jwil...@stads.net):

On 26-3-2019 16:39, Ashley Merrick wrote:

Have you upgraded any OSD's?


No didn't go through with the osd's

Just checking here: are your sure all PGs have been scrubbed while
running Luminous? As the release notes [1] mention this:

"If you are unsure whether or not your Luminous cluster has completed a
full scrub of all PGs, you can check your clusters state by running:

# ceph osd dump | grep ^flags

In order to be able to proceed to Nautilus, your OSD map must include
the recovery_deletes and purged_snapdirs flags."


Yes I did check that.

No everything went fine, exactly as Ashley predicted

"On a test cluster I saw the same and as I upgraded / restarted the 
OSD's the PG's started to show online till it was 100%."


So I upgraded the first osd, and exactly that amount of percentage of 
OSD's became active.

And every server the same percentage was added.
And then finaly, with the last one I got 100% active.

So went without problems.
But it looked a bit uggly that's why I asked.

And the new Nautilus versions is really a big plus in almost every way.

Sorry for not getting back how it went. I was not sure if I should 
bother the mailing list.


Thanks for your time.




Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous

P.s. I expect most users upgrade to Mimic first, then go to Nautilus.
It might be a better tested upgrade path ...




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-02 Thread Yan, Zheng
On Tue, Apr 2, 2019 at 9:10 PM Paul Emmerich  wrote:
>
> On Tue, Apr 2, 2019 at 3:05 PM Yan, Zheng  wrote:
> >
> > On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn  wrote:
> > >
> > > Hi!
> > >
> > > Am 29.03.2019 um 23:56 schrieb Paul Emmerich:
> > > > There's also some metadata overhead etc. You might want to consider
> > > > enabling inline data in cephfs to handle small files in a
> > > > store-efficient way (note that this feature is officially marked as
> > > > experimental, though).
> > > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data
> > >
> > > Is there something missing from the documentation? I have turned on this
> > > feature:
> > >
> >
> > I don't use this feature.  We don't have plan to mark this feature
> > stable. (probably we will remove this feature in the furthure).
>
> We also don't use this feature in any of our production clusters
> (because it's marked experimental).
>
> But it seems like a really useful feature and I know of at least one
> real-world production cluster using this with great success...
> So why remove it?
>

mds needs to serve both data/metadata requests. It only suites for
small amount of data.

>
> Paul
>
> >
> > Yan, Zheng
> >
> >
> >
> > > $ ceph fs dump | grep inline_data
> > > dumped fsmap epoch 1224
> > > inline_data enabled
> > >
> > > I have reduced the size of the bonnie-generated files to 1 byte. But
> > > this is the situation halfway into the test: (output slightly shortened)
> > >
> > > $ rados df
> > > POOL_NAME  USED OBJECTS CLONES   COPIES
> > > fs-data 3.2 MiB 3390041  0 10170123
> > > fs-metadata 772 MiB2249  0 6747
> > >
> > > total_objects3392290
> > > total_used   643 GiB
> > > total_avail  957 GiB
> > > total_space  1.6 TiB
> > >
> > > i.e. bonnie has created a little over 3 million files, for which the
> > > same number of objects was created in the data pool. So the raw usage is
> > > again at more than 500 GB.
> > >
> > > If the data was inlined, I would expect far less objects in the data
> > > pool - actually none at all - and maybe some more usage in the metadata
> > > pool.
> > >
> > > Do I have to restart any daemons after turning on inline_data? Am I
> > > missing anything else here?
> > >
> > > For the record:
> > >
> > > $ ceph versions
> > > {
> > >  "mon": {
> > >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > > nautilus (stable)": 3
> > >  },
> > >  "mgr": {
> > >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > > nautilus (stable)": 3
> > >  },
> > >  "osd": {
> > >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > > nautilus (stable)": 16
> > >  },
> > >  "mds": {
> > >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > > nautilus (stable)": 2
> > >  },
> > >  "overall": {
> > >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > > nautilus (stable)": 24
> > >  }
> > > }
> > >
> > > --
> > > Jörn Clausen
> > > Daten- und Rechenzentrum
> > > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> > > Düsternbrookerweg 20
> > > 24105 Kiel
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-02 Thread Jonas Jelten
On 02/04/2019 15.05, Yan, Zheng wrote:
> I don't use this feature.  We don't have plan to mark this feature
> stable. (probably we will remove this feature in the furthure).

Oh no! We have activated inline_data since our cluster does have lots of small 
files (but also big ones), and
performance increased significantly, especially when deleting those small files 
on CephFS.

I hope inline_data is not removed, but improved (or at least marked stable) 
instead!

Of course, if CephFS can handle small files with the same performance after 
inline_data was removed, that would also be
okay. But stepping back would hurt :)


-- Jonas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and many small files

2019-04-02 Thread Frédéric Nass

Hello,

I haven't had any issues either with 4k allocation size in cluster 
holding 358M objects for 116TB (237TB raw) and 2.264B chunks/replicas.


This is an average of 324k per object and 12.6M of chunks/replicas per 
OSD with RocksDB sizes going from 12.1GB to 21.14GB depending on how 
much PGs the OSDs have.
RocksDB sizes will lower as we add more OSDs to the cluster by the end 
of this year.


We've seen a huge latency improvement by moving OSDs to Bluestore. 
Filestore (XFS) wouldn't operate well anymore with over 10M of files, 
with a negligible fragmentation factor and 8/40 split/merge thresholds.


Frédéric.

Le 01/04/2019 à 14:47, Sergey Malinin a écrit :

I haven't had any issues with 4k allocation size in cluster holding 189M files.

April 1, 2019 2:04 PM, "Paul Emmerich"  wrote:


I'm not sure about the real-world impacts of a lower min alloc size or
the rationale behind the default values for HDDs (64) and SSDs (16kb).

Paul

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Eugen Block

Sorry -- you need the "" as part of that command.


My bad, I only read this from the help page ignoring the   
(and forgot the pool name):


  -a [ --all ] list snapshots from all namespaces

I figured this would list all existing snapshots, similar to the "rbd  
-p  ls --long" command. Thanks for the clarification.


Eugen


Zitat von Jason Dillaman :


On Tue, Apr 2, 2019 at 8:42 AM Eugen Block  wrote:


Hi,

> If you run "rbd snap ls --all", you should see a snapshot in
> the "trash" namespace.

I just tried the command "rbd snap ls --all" on a lab cluster
(nautilus) and get this error:

ceph-2:~ # rbd snap ls --all
rbd: image name was not specified


Sorry -- you need the "" as part of that command.


Are there any requirements I haven't noticed? This lab cluster was
upgraded from Mimic a couple of weeks ago.

ceph-2:~ # ceph version
ceph version 14.1.0-559-gf1a72cff25
(f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev)

Regards,
Eugen


Zitat von Jason Dillaman :

> On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich
>  wrote:
>>
>> Hi,
>>
>> on one of my clusters, I'm getting error message which is getting
>> me a bit nervous.. while listing contents of a pool I'm getting
>> error for one of images:
>>
>> [root@node1 ~]# rbd ls -l nvme > /dev/null
>> rbd: error processing image  xxx: (2) No such file or directory
>>
>> [root@node1 ~]# rbd info nvme/xxx
>> rbd image 'xxx':
>> size 60 GiB in 15360 objects
>> order 22 (4 MiB objects)
>> id: 132773d6deb56
>> block_name_prefix: rbd_data.132773d6deb56
>> format: 2
>> features: layering, operations
>> op_features: snap-trash
>> flags:
>> create_timestamp: Wed Aug 29 12:25:13 2018
>>
>> volume contains production data and seems to be working  
correctly (it's used

>> by VM)
>>
>> is this something to worry about? What is snap-trash feature?
>> wasn't able to google
>> much about it..
>
> This implies that you are (or were) using transparent image clones and
> that you deleted a snapshot that had one or more child images attached
> to it. If you run "rbd snap ls --all", you should see a snapshot in
> the "trash" namespace. You can also list its child images by running
> "rbd children --snap-id  ".
>
> There definitely is an issue w/ the "rbd ls --long" command in that
> when it attempts to list all snapshots in the image, it is incorrectly
> using the snapshot's name instead of it's ID. I've opened a tracker
> ticket to get the bug fixed [1]. It was fixed in Nautilus but it
> wasn't flagged for backport to Mimic.
>
>> I'm running ceph 13.2.4 on centos 7.
>>
>> I'd be gratefull any help
>>
>> BR
>>
>> nik
>>
>>
>> --
>> -
>> Ing. Nikola CIPRICH
>> LinuxBox.cz, s.r.o.
>> 28.rijna 168, 709 00 Ostrava
>>
>> tel.:   +420 591 166 214
>> fax:+420 596 621 273
>> mobil:  +420 777 093 799
>> www.linuxbox.cz
>>
>> mobil servis: +420 737 238 656
>> email servis: ser...@linuxbox.cz
>> -
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> [1] http://tracker.ceph.com/issues/39081
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Jason




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-02 Thread Paul Emmerich
On Tue, Apr 2, 2019 at 3:05 PM Yan, Zheng  wrote:
>
> On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn  wrote:
> >
> > Hi!
> >
> > Am 29.03.2019 um 23:56 schrieb Paul Emmerich:
> > > There's also some metadata overhead etc. You might want to consider
> > > enabling inline data in cephfs to handle small files in a
> > > store-efficient way (note that this feature is officially marked as
> > > experimental, though).
> > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data
> >
> > Is there something missing from the documentation? I have turned on this
> > feature:
> >
>
> I don't use this feature.  We don't have plan to mark this feature
> stable. (probably we will remove this feature in the furthure).

We also don't use this feature in any of our production clusters
(because it's marked experimental).

But it seems like a really useful feature and I know of at least one
real-world production cluster using this with great success...
So why remove it?


Paul

>
> Yan, Zheng
>
>
>
> > $ ceph fs dump | grep inline_data
> > dumped fsmap epoch 1224
> > inline_data enabled
> >
> > I have reduced the size of the bonnie-generated files to 1 byte. But
> > this is the situation halfway into the test: (output slightly shortened)
> >
> > $ rados df
> > POOL_NAME  USED OBJECTS CLONES   COPIES
> > fs-data 3.2 MiB 3390041  0 10170123
> > fs-metadata 772 MiB2249  0 6747
> >
> > total_objects3392290
> > total_used   643 GiB
> > total_avail  957 GiB
> > total_space  1.6 TiB
> >
> > i.e. bonnie has created a little over 3 million files, for which the
> > same number of objects was created in the data pool. So the raw usage is
> > again at more than 500 GB.
> >
> > If the data was inlined, I would expect far less objects in the data
> > pool - actually none at all - and maybe some more usage in the metadata
> > pool.
> >
> > Do I have to restart any daemons after turning on inline_data? Am I
> > missing anything else here?
> >
> > For the record:
> >
> > $ ceph versions
> > {
> >  "mon": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 3
> >  },
> >  "mgr": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 3
> >  },
> >  "osd": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 16
> >  },
> >  "mds": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 2
> >  },
> >  "overall": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 24
> >  }
> > }
> >
> > --
> > Jörn Clausen
> > Daten- und Rechenzentrum
> > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> > Düsternbrookerweg 20
> > 24105 Kiel
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-02 Thread Yan, Zheng
On Tue, Apr 2, 2019 at 9:05 PM Yan, Zheng  wrote:
>
> On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn  wrote:
> >
> > Hi!
> >
> > Am 29.03.2019 um 23:56 schrieb Paul Emmerich:
> > > There's also some metadata overhead etc. You might want to consider
> > > enabling inline data in cephfs to handle small files in a
> > > store-efficient way (note that this feature is officially marked as
> > > experimental, though).
> > > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data
> >
> > Is there something missing from the documentation? I have turned on this
> > feature:
> >
>
> I don't use this feature.  We don't have plan to mark this feature
> stable. (probably we will remove this feature in the furthure).
>
I mean "don't use this feature"


> Yan, Zheng
>
>
>
> > $ ceph fs dump | grep inline_data
> > dumped fsmap epoch 1224
> > inline_data enabled
> >
> > I have reduced the size of the bonnie-generated files to 1 byte. But
> > this is the situation halfway into the test: (output slightly shortened)
> >
> > $ rados df
> > POOL_NAME  USED OBJECTS CLONES   COPIES
> > fs-data 3.2 MiB 3390041  0 10170123
> > fs-metadata 772 MiB2249  0 6747
> >
> > total_objects3392290
> > total_used   643 GiB
> > total_avail  957 GiB
> > total_space  1.6 TiB
> >
> > i.e. bonnie has created a little over 3 million files, for which the
> > same number of objects was created in the data pool. So the raw usage is
> > again at more than 500 GB.
> >
> > If the data was inlined, I would expect far less objects in the data
> > pool - actually none at all - and maybe some more usage in the metadata
> > pool.
> >
> > Do I have to restart any daemons after turning on inline_data? Am I
> > missing anything else here?
> >
> > For the record:
> >
> > $ ceph versions
> > {
> >  "mon": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 3
> >  },
> >  "mgr": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 3
> >  },
> >  "osd": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 16
> >  },
> >  "mds": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 2
> >  },
> >  "overall": {
> >  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> > nautilus (stable)": 24
> >  }
> > }
> >
> > --
> > Jörn Clausen
> > Daten- und Rechenzentrum
> > GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> > Düsternbrookerweg 20
> > 24105 Kiel
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-02 Thread Yan, Zheng
On Tue, Apr 2, 2019 at 8:23 PM Clausen, Jörn  wrote:
>
> Hi!
>
> Am 29.03.2019 um 23:56 schrieb Paul Emmerich:
> > There's also some metadata overhead etc. You might want to consider
> > enabling inline data in cephfs to handle small files in a
> > store-efficient way (note that this feature is officially marked as
> > experimental, though).
> > http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data
>
> Is there something missing from the documentation? I have turned on this
> feature:
>

I don't use this feature.  We don't have plan to mark this feature
stable. (probably we will remove this feature in the furthure).

Yan, Zheng



> $ ceph fs dump | grep inline_data
> dumped fsmap epoch 1224
> inline_data enabled
>
> I have reduced the size of the bonnie-generated files to 1 byte. But
> this is the situation halfway into the test: (output slightly shortened)
>
> $ rados df
> POOL_NAME  USED OBJECTS CLONES   COPIES
> fs-data 3.2 MiB 3390041  0 10170123
> fs-metadata 772 MiB2249  0 6747
>
> total_objects3392290
> total_used   643 GiB
> total_avail  957 GiB
> total_space  1.6 TiB
>
> i.e. bonnie has created a little over 3 million files, for which the
> same number of objects was created in the data pool. So the raw usage is
> again at more than 500 GB.
>
> If the data was inlined, I would expect far less objects in the data
> pool - actually none at all - and maybe some more usage in the metadata
> pool.
>
> Do I have to restart any daemons after turning on inline_data? Am I
> missing anything else here?
>
> For the record:
>
> $ ceph versions
> {
>  "mon": {
>  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> nautilus (stable)": 3
>  },
>  "mgr": {
>  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> nautilus (stable)": 3
>  },
>  "osd": {
>  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> nautilus (stable)": 16
>  },
>  "mds": {
>  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> nautilus (stable)": 2
>  },
>  "overall": {
>  "ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc)
> nautilus (stable)": 24
>  }
> }
>
> --
> Jörn Clausen
> Daten- und Rechenzentrum
> GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
> Düsternbrookerweg 20
> 24105 Kiel
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Jason Dillaman
On Tue, Apr 2, 2019 at 8:42 AM Eugen Block  wrote:
>
> Hi,
>
> > If you run "rbd snap ls --all", you should see a snapshot in
> > the "trash" namespace.
>
> I just tried the command "rbd snap ls --all" on a lab cluster
> (nautilus) and get this error:
>
> ceph-2:~ # rbd snap ls --all
> rbd: image name was not specified

Sorry -- you need the "" as part of that command.

> Are there any requirements I haven't noticed? This lab cluster was
> upgraded from Mimic a couple of weeks ago.
>
> ceph-2:~ # ceph version
> ceph version 14.1.0-559-gf1a72cff25
> (f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev)
>
> Regards,
> Eugen
>
>
> Zitat von Jason Dillaman :
>
> > On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich
> >  wrote:
> >>
> >> Hi,
> >>
> >> on one of my clusters, I'm getting error message which is getting
> >> me a bit nervous.. while listing contents of a pool I'm getting
> >> error for one of images:
> >>
> >> [root@node1 ~]# rbd ls -l nvme > /dev/null
> >> rbd: error processing image  xxx: (2) No such file or directory
> >>
> >> [root@node1 ~]# rbd info nvme/xxx
> >> rbd image 'xxx':
> >> size 60 GiB in 15360 objects
> >> order 22 (4 MiB objects)
> >> id: 132773d6deb56
> >> block_name_prefix: rbd_data.132773d6deb56
> >> format: 2
> >> features: layering, operations
> >> op_features: snap-trash
> >> flags:
> >> create_timestamp: Wed Aug 29 12:25:13 2018
> >>
> >> volume contains production data and seems to be working correctly (it's 
> >> used
> >> by VM)
> >>
> >> is this something to worry about? What is snap-trash feature?
> >> wasn't able to google
> >> much about it..
> >
> > This implies that you are (or were) using transparent image clones and
> > that you deleted a snapshot that had one or more child images attached
> > to it. If you run "rbd snap ls --all", you should see a snapshot in
> > the "trash" namespace. You can also list its child images by running
> > "rbd children --snap-id  ".
> >
> > There definitely is an issue w/ the "rbd ls --long" command in that
> > when it attempts to list all snapshots in the image, it is incorrectly
> > using the snapshot's name instead of it's ID. I've opened a tracker
> > ticket to get the bug fixed [1]. It was fixed in Nautilus but it
> > wasn't flagged for backport to Mimic.
> >
> >> I'm running ceph 13.2.4 on centos 7.
> >>
> >> I'd be gratefull any help
> >>
> >> BR
> >>
> >> nik
> >>
> >>
> >> --
> >> -
> >> Ing. Nikola CIPRICH
> >> LinuxBox.cz, s.r.o.
> >> 28.rijna 168, 709 00 Ostrava
> >>
> >> tel.:   +420 591 166 214
> >> fax:+420 596 621 273
> >> mobil:  +420 777 093 799
> >> www.linuxbox.cz
> >>
> >> mobil servis: +420 737 238 656
> >> email servis: ser...@linuxbox.cz
> >> -
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > [1] http://tracker.ceph.com/issues/39081
> >
> > --
> > Jason
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Eugen Block

Hi,


If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace.


I just tried the command "rbd snap ls --all" on a lab cluster  
(nautilus) and get this error:


ceph-2:~ # rbd snap ls --all
rbd: image name was not specified

Are there any requirements I haven't noticed? This lab cluster was  
upgraded from Mimic a couple of weeks ago.


ceph-2:~ # ceph version
ceph version 14.1.0-559-gf1a72cff25  
(f1a72cff2522833d16ff057ed43eeaddfc17ea8a) nautilus (dev)


Regards,
Eugen


Zitat von Jason Dillaman :


On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich
 wrote:


Hi,

on one of my clusters, I'm getting error message which is getting
me a bit nervous.. while listing contents of a pool I'm getting
error for one of images:

[root@node1 ~]# rbd ls -l nvme > /dev/null
rbd: error processing image  xxx: (2) No such file or directory

[root@node1 ~]# rbd info nvme/xxx
rbd image 'xxx':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
id: 132773d6deb56
block_name_prefix: rbd_data.132773d6deb56
format: 2
features: layering, operations
op_features: snap-trash
flags:
create_timestamp: Wed Aug 29 12:25:13 2018

volume contains production data and seems to be working correctly (it's used
by VM)

is this something to worry about? What is snap-trash feature?  
wasn't able to google

much about it..


This implies that you are (or were) using transparent image clones and
that you deleted a snapshot that had one or more child images attached
to it. If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace. You can also list its child images by running
"rbd children --snap-id  ".

There definitely is an issue w/ the "rbd ls --long" command in that
when it attempts to list all snapshots in the image, it is incorrectly
using the snapshot's name instead of it's ID. I've opened a tracker
ticket to get the bug fixed [1]. It was fixed in Nautilus but it
wasn't flagged for backport to Mimic.


I'm running ceph 13.2.4 on centos 7.

I'd be gratefull any help

BR

nik


--
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[1] http://tracker.ceph.com/issues/39081

--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Jason Dillaman
On Tue, Apr 2, 2019 at 4:19 AM Nikola Ciprich
 wrote:
>
> Hi,
>
> on one of my clusters, I'm getting error message which is getting
> me a bit nervous.. while listing contents of a pool I'm getting
> error for one of images:
>
> [root@node1 ~]# rbd ls -l nvme > /dev/null
> rbd: error processing image  xxx: (2) No such file or directory
>
> [root@node1 ~]# rbd info nvme/xxx
> rbd image 'xxx':
> size 60 GiB in 15360 objects
> order 22 (4 MiB objects)
> id: 132773d6deb56
> block_name_prefix: rbd_data.132773d6deb56
> format: 2
> features: layering, operations
> op_features: snap-trash
> flags:
> create_timestamp: Wed Aug 29 12:25:13 2018
>
> volume contains production data and seems to be working correctly (it's used
> by VM)
>
> is this something to worry about? What is snap-trash feature? wasn't able to 
> google
> much about it..

This implies that you are (or were) using transparent image clones and
that you deleted a snapshot that had one or more child images attached
to it. If you run "rbd snap ls --all", you should see a snapshot in
the "trash" namespace. You can also list its child images by running
"rbd children --snap-id  ".

There definitely is an issue w/ the "rbd ls --long" command in that
when it attempts to list all snapshots in the image, it is incorrectly
using the snapshot's name instead of it's ID. I've opened a tracker
ticket to get the bug fixed [1]. It was fixed in Nautilus but it
wasn't flagged for backport to Mimic.

> I'm running ceph 13.2.4 on centos 7.
>
> I'd be gratefull any help
>
> BR
>
> nik
>
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[1] http://tracker.ceph.com/issues/39081

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inline_data (was: CephFS and many small files)

2019-04-02 Thread Clausen , Jörn

Hi!

Am 29.03.2019 um 23:56 schrieb Paul Emmerich:

There's also some metadata overhead etc. You might want to consider
enabling inline data in cephfs to handle small files in a
store-efficient way (note that this feature is officially marked as
experimental, though).
http://docs.ceph.com/docs/master/cephfs/experimental-features/#inline-data


Is there something missing from the documentation? I have turned on this 
feature:


$ ceph fs dump | grep inline_data
dumped fsmap epoch 1224
inline_data enabled

I have reduced the size of the bonnie-generated files to 1 byte. But 
this is the situation halfway into the test: (output slightly shortened)


$ rados df
POOL_NAME  USED OBJECTS CLONES   COPIES
fs-data 3.2 MiB 3390041  0 10170123
fs-metadata 772 MiB2249  0 6747

total_objects3392290
total_used   643 GiB
total_avail  957 GiB
total_space  1.6 TiB

i.e. bonnie has created a little over 3 million files, for which the 
same number of objects was created in the data pool. So the raw usage is 
again at more than 500 GB.


If the data was inlined, I would expect far less objects in the data 
pool - actually none at all - and maybe some more usage in the metadata 
pool.


Do I have to restart any daemons after turning on inline_data? Am I 
missing anything else here?


For the record:

$ ceph versions
{
"mon": {
"ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) 
nautilus (stable)": 3

},
"mgr": {
"ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) 
nautilus (stable)": 3

},
"osd": {
"ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) 
nautilus (stable)": 16

},
"mds": {
"ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) 
nautilus (stable)": 2

},
"overall": {
"ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) 
nautilus (stable)": 24

}
}

--
Jörn Clausen
Daten- und Rechenzentrum
GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
Düsternbrookerweg 20
24105 Kiel



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nautilus upgrade problem

2019-04-02 Thread Stefan Kooman
Quoting Stadsnet (jwil...@stads.net):
> On 26-3-2019 16:39, Ashley Merrick wrote:
> >Have you upgraded any OSD's?
> 
> 
> No didn't go through with the osd's

Just checking here: are your sure all PGs have been scrubbed while
running Luminous? As the release notes [1] mention this:

"If you are unsure whether or not your Luminous cluster has completed a
full scrub of all PGs, you can check your clusters state by running:

# ceph osd dump | grep ^flags

In order to be able to proceed to Nautilus, your OSD map must include
the recovery_deletes and purged_snapdirs flags."

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous

P.s. I expect most users upgrade to Mimic first, then go to Nautilus.
It might be a better tested upgrade path ... 


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-02 Thread Hector Martin

On 02/04/2019 18.27, Christian Balzer wrote:

I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
pool with 1024 PGs.


(20 choose 2) is 190, so you're never going to have more than that many 
unique sets of OSDs.


I just looked at the OSD distribution for a replica 3 pool across 48 
OSDs with 4096 PGs that I have and the result is reasonable. There are 
3782 unique OSD tuples, out of (48 choose 3) = 17296 options. Since this 
is a random process, due to the birthday paradox, some duplicates are 
expected after only the order of 17296^0.5 = ~131 PGs; at 4096 PGs 
having 3782 unique choices seems to pass the gut feeling test. Too lazy 
to do the math closed form, but here's a quick simulation:


>>> len(set(random.randrange(17296) for i in range(4096)))
3671

So I'm actually slightly ahead.

At the numbers in my previous example (1500 OSDs, 50k pool PGs), 
statistically you should get something like ~3 collisions on average, so 
negligible.



Another thing to look at here is of course critical period and disk
failure probabilities, these guys explain the logic behind their
calculator, would be delighted if you could have a peek and comment.

https://www.memset.com/support/resources/raid-calculator/


I'll take a look tonight :)

--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving pools between cluster

2019-04-02 Thread Stefan Kooman
Quoting Burkhard Linke (burkhard.li...@computational.bio.uni-giessen.de):
> Hi,
> Images:
> 
> Straight-forward attempt would be exporting all images with qemu-img from
> one cluster, and uploading them again on the second cluster. But this will
> break snapshots, protections etc.

You can use rbd-mirror [1] (RBD mirroring requires the Ceph Jewel release or
later.). You do need to be able to set the "journaling" and
"exclusive-lock" feature on the rbd images (rbd feature enable
{pool-name}/{image-name} --image-feature exclusive-lock,journaling).
This will preserve snapshots, etc. When everything is mirrored you can
shutdown the VMs (or one by one) and promote the image(s) on the new
cluster, and have the VM(s) use the new cluster for their storage.
Note: You can also mirror a whole pool instead of mirroring on image level.

Gr. Stefan

[1]: http://docs.ceph.com/docs/mimic/rbd/rbd-mirroring/

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-02 Thread Christian Balzer


Hello Hector,

Firstly I'm so happy somebody actually replied.

On Tue, 2 Apr 2019 16:43:10 +0900 Hector Martin wrote:

> On 31/03/2019 17.56, Christian Balzer wrote:
> > Am I correct that unlike with with replication there isn't a maximum size
> > of the critical path OSDs?  
> 
> As far as I know, the math for calculating the probability of data loss 
> wrt placement groups is the same for EC and for replication. Replication 
> to n copies should be equivalent to EC with k=1 and m=(n-1).
> 
> > Meaning that with replication x3 and typical values of 100 PGs per OSD at
> > most 300 OSDs form a set out of which 3 OSDs need to fail for data loss.
> > The statistical likelihood for that based on some assumptions
> > is significant, but not nightmarishly so.
> > A cluster with 1500 OSDs in total is thus as susceptible as one with just
> > 300.
> > Meaning that 3 disk losses in the big cluster don't necessarily mean data
> > loss at all.  
> 
> Someone might correct me on this, but here's my take on the math.
> 
> If you have 100 PGs per OSD, 1500 OSDs, and replication 3, you have:
> 
> 1500 * 100 / 3 = 5 pool PGs, and thus 5 (hopefully) different 
> 3-sets of OSDs.
>
I think your math is essentially correct, but so seems to be the
"hopefully" part.

I did a quick peek at my test cluster (20 OSDs, 5 hosts) and a replica 2
pool with 1024 PGs.
Which should give us 1000 sets of OSDs to choose from given your formula.
Just looking at OSD 0 and the first 6 other OSDs out of that list of 1024
PGs gives us this:
---
UP_PRIMARY  ACTING
[0,1]  0  
[0,2]  0   
[0,2]  0   
[0,2]  0   
[0,3]  0
[0,3]  0  
[0,3]  0   
[0,3]  0   
[0,3]  0   
[0,5]  0   
[0,5]  0   
[0,5]  0   
[0,5]  0   
[0,5]  0   
[0,6]  0   
[0,6]  0   
[1,0]  1   
[1,0]  1   
[1,0]  1   
[2,0]  2   
[2,0]  2   
[2,0]  2   
[3,0]  3   
[3,0]  3   
[5,0]  5   
[5,0]  5   
[5,0]  5   
[6,0]  6   
[6,0]  6  
[6,0]  6  
---

So this looks significantly worse than the theoretical set of choices.
 
Another thing to look at here is of course critical period and disk
failure probabilities, these guys explain the logic behind their
calculator, would be delighted if you could have a peek and comment.

https://www.memset.com/support/resources/raid-calculator/

Thanks again for the feedback!

Christian

> (1500 choose 3) = 561375500 possible sets of 3 OSDs
> 
> Therefore if you lose 3 random OSDs, your chance of (any) data loss is 
> 5/561375500 = ~0.008%. (and if you *do* get unlucky and hit the 
> wrong set of 3 OSDs, you can expect to lose 1/5 = ~0.002% of your data)
> 
> > However it feels that with EC all OSDs can essentially be in the same set
> > and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per
> > OSD would affect every last object in that cluster, not just a subset.  
> 
> The math should work essentially the same way:
> 
> 1500 * 100 / 15 = 1 15-sets of OSDs
> 
> (1500 choose 15) = 3.1215495e+35 possible 15-sets of OSDs
> 
> Now if 6 OSDs fail that will affect many potential 15-sets of OSDs 
> chosen with the remaining OSD in the cluster:
> 
> ((1500 - 6) choose 9) = 9.9748762e+22
> 
> Putting it together, the chance of any data loss from a simultaneous 
> loss of 6 random OSDs:
> 
> 1 / 3.1215495e+35 * 9.9748762e+22 = 0.0032%
> 
> And if you *do* get unlucky you can expect to lose 1/1 = ~0.01% of 
> your data.
> 
> So your chance of data loss is much smaller with such a wide EC 
> encoding, but if you do lose a PG you'll lose more data because there 
> are fewer PGs.
> 
> Feedback on my math welcome.
> -- 
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Moving pools between cluster

2019-04-02 Thread Burkhard Linke

Hi,


we are about to setup a new Ceph cluster for our Openstack cloud. Ceph 
is used for images, volumes and object storage. I'm unsure to handle 
these cases and how to move the data correctly.



Object storage:

I consider this the easiest case, since RGW itself provides the 
necessary means to synchronize clusters. But the pools are rather small 
(~5 TB for buckets), so maybe there's an easier way? How does RGW refer 
to the various pools internally? By name? By ID? (ID would be a problem, 
since a simple pool copy won't work in this case)



Images:

Straight-forward attempt would be exporting all images with qemu-img 
from one cluster, and uploading them again on the second cluster. But 
this will break snapshots, protections etc.



Volumes:

This is the most difficult case. The pool is the largest one affected 
(~60 TB), and many volumes are boot-from-volume instances acting as COW 
copy for an image. I would prefer not to flatten these images and thus 
generate a lot more data.



There are other pools we use outside of Openstack, so adding the new 
hosts to the existing cluster, moving the data by crush rules and 
splitting the cluster afterwards is not an option. Keeping all hosts in 
a single cluster and separating the pools logically within crush is also 
undesired due to administrative reasons (but will be the last resort if 
necessary).



Any comments on this? How did you move individual pools to a new cluster 
in the past?



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd: error processing image xxx (2) No such file or directory

2019-04-02 Thread Nikola Ciprich
Hi,

on one of my clusters, I'm getting error message which is getting
me a bit nervous.. while listing contents of a pool I'm getting
error for one of images:

[root@node1 ~]# rbd ls -l nvme > /dev/null
rbd: error processing image  xxx: (2) No such file or directory

[root@node1 ~]# rbd info nvme/xxx
rbd image 'xxx':
size 60 GiB in 15360 objects
order 22 (4 MiB objects)
id: 132773d6deb56
block_name_prefix: rbd_data.132773d6deb56
format: 2
features: layering, operations
op_features: snap-trash
flags: 
create_timestamp: Wed Aug 29 12:25:13 2018

volume contains production data and seems to be working correctly (it's used
by VM)

is this something to worry about? What is snap-trash feature? wasn't able to 
google
much about it..

I'm running ceph 13.2.4 on centos 7.

I'd be gratefull any help

BR

nik


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding failure domain (again)

2019-04-02 Thread Hector Martin

On 31/03/2019 17.56, Christian Balzer wrote:

Am I correct that unlike with with replication there isn't a maximum size
of the critical path OSDs?


As far as I know, the math for calculating the probability of data loss 
wrt placement groups is the same for EC and for replication. Replication 
to n copies should be equivalent to EC with k=1 and m=(n-1).



Meaning that with replication x3 and typical values of 100 PGs per OSD at
most 300 OSDs form a set out of which 3 OSDs need to fail for data loss.
The statistical likelihood for that based on some assumptions
is significant, but not nightmarishly so.
A cluster with 1500 OSDs in total is thus as susceptible as one with just
300.
Meaning that 3 disk losses in the big cluster don't necessarily mean data
loss at all.


Someone might correct me on this, but here's my take on the math.

If you have 100 PGs per OSD, 1500 OSDs, and replication 3, you have:

1500 * 100 / 3 = 5 pool PGs, and thus 5 (hopefully) different 
3-sets of OSDs.


(1500 choose 3) = 561375500 possible sets of 3 OSDs

Therefore if you lose 3 random OSDs, your chance of (any) data loss is 
5/561375500 = ~0.008%. (and if you *do* get unlucky and hit the 
wrong set of 3 OSDs, you can expect to lose 1/5 = ~0.002% of your data)



However it feels that with EC all OSDs can essentially be in the same set
and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per
OSD would affect every last object in that cluster, not just a subset.


The math should work essentially the same way:

1500 * 100 / 15 = 1 15-sets of OSDs

(1500 choose 15) = 3.1215495e+35 possible 15-sets of OSDs

Now if 6 OSDs fail that will affect many potential 15-sets of OSDs 
chosen with the remaining OSD in the cluster:


((1500 - 6) choose 9) = 9.9748762e+22

Putting it together, the chance of any data loss from a simultaneous 
loss of 6 random OSDs:


1 / 3.1215495e+35 * 9.9748762e+22 = 0.0032%

And if you *do* get unlucky you can expect to lose 1/1 = ~0.01% of 
your data.


So your chance of data loss is much smaller with such a wide EC 
encoding, but if you do lose a PG you'll lose more data because there 
are fewer PGs.


Feedback on my math welcome.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS stuck at replaying status

2019-04-02 Thread Yan, Zheng
please set debug_mds=10, and try again

On Tue, Apr 2, 2019 at 1:01 PM Albert Yue  wrote:
>
> Hi,
>
> This happens after we restart the active MDS, and somehow the standby MDS 
> daemon cannot take over successfully and is stuck at up:replaying. It is 
> showing the following log. Any idea on how to fix this?
>
> 2019-04-02 12:54:00.985079 7f6f70670700  1 mds.WXS0023 respawn
> 2019-04-02 12:54:00.985095 7f6f70670700  1 mds.WXS0023  e: '/usr/bin/ceph-mds'
> 2019-04-02 12:54:00.985097 7f6f70670700  1 mds.WXS0023  0: '/usr/bin/ceph-mds'
> 2019-04-02 12:54:00.985099 7f6f70670700  1 mds.WXS0023  1: '-f'
> 2019-04-02 12:54:00.985100 7f6f70670700  1 mds.WXS0023  2: '--cluster'
> 2019-04-02 12:54:00.985101 7f6f70670700  1 mds.WXS0023  3: 'ceph'
> 2019-04-02 12:54:00.985102 7f6f70670700  1 mds.WXS0023  4: '--id'
> 2019-04-02 12:54:00.985103 7f6f70670700  1 mds.WXS0023  5: 'WXS0023'
> 2019-04-02 12:54:00.985104 7f6f70670700  1 mds.WXS0023  6: '--setuser'
> 2019-04-02 12:54:00.985105 7f6f70670700  1 mds.WXS0023  7: 'ceph'
> 2019-04-02 12:54:00.985106 7f6f70670700  1 mds.WXS0023  8: '--setgroup'
> 2019-04-02 12:54:00.985107 7f6f70670700  1 mds.WXS0023  9: 'ceph'
> 2019-04-02 12:54:00.985142 7f6f70670700  1 mds.WXS0023 respawning with exe 
> /usr/bin/ceph-mds
> 2019-04-02 12:54:00.985145 7f6f70670700  1 mds.WXS0023  exe_path 
> /proc/self/exe
> 2019-04-02 12:54:02.139272 7ff8a739a200  0 ceph version 12.2.5 
> (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process 
> (unknown), pid 3369045
> 2019-04-02 12:54:02.141565 7ff8a739a200  0 pidfile_write: ignore empty 
> --pid-file
> 2019-04-02 12:54:06.675604 7ff8a0ecd700  1 mds.WXS0023 handle_mds_map standby
> 2019-04-02 12:54:26.114757 7ff8a0ecd700  1 mds.0.136021 handle_mds_map i am 
> now mds.0.136021
> 2019-04-02 12:54:26.114764 7ff8a0ecd700  1 mds.0.136021 handle_mds_map state 
> change up:boot --> up:replay
> 2019-04-02 12:54:26.114779 7ff8a0ecd700  1 mds.0.136021 replay_start
> 2019-04-02 12:54:26.114784 7ff8a0ecd700  1 mds.0.136021  recovery set is
> 2019-04-02 12:54:26.114789 7ff8a0ecd700  1 mds.0.136021  waiting for osdmap 
> 14333 (which blacklists prior instance)
> 2019-04-02 12:54:26.141256 7ff89a6c0700  0 mds.0.cache creating system inode 
> with ino:0x100
> 2019-04-02 12:54:26.141454 7ff89a6c0700  0 mds.0.cache creating system inode 
> with ino:0x1
> 2019-04-02 12:54:50.148022 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:54:50.148049 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping 
> beacon, heartbeat map not healthy
> 2019-04-02 12:54:52.143637 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:54:54.148122 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:54:54.148157 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping 
> beacon, heartbeat map not healthy
> 2019-04-02 12:54:57.143730 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:54:58.148239 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:54:58.148249 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping 
> beacon, heartbeat map not healthy
> 2019-04-02 12:55:02.143819 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:55:02.148311 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:55:02.148330 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping 
> beacon, heartbeat map not healthy
> 2019-04-02 12:55:06.148393 7ff89dec7700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:55:06.148416 7ff89dec7700  1 mds.beacon.WXS0023 _send skipping 
> beacon, heartbeat map not healthy
> 2019-04-02 12:55:07.143914 7ff8a1ecf700  1 heartbeat_map is_healthy 'MDSRank' 
> had timed out after 15
> 2019-04-02 12:55:07.615602 7ff89e6c8700  1 heartbeat_map reset_timeout 
> 'MDSRank' had timed out after 15
> 2019-04-02 12:55:07.618294 7ff8a0ecd700  1 mds.WXS0023 map removed me (mds.-1 
> gid:7441294) from cluster due to lost contact; respawning
> 2019-04-02 12:55:07.618296 7ff8a0ecd700  1 mds.WXS0023 respawn
> 2019-04-02 12:55:07.618314 7ff8a0ecd700  1 mds.WXS0023  e: '/usr/bin/ceph-mds'
> 2019-04-02 12:55:07.618318 7ff8a0ecd700  1 mds.WXS0023  0: '/usr/bin/ceph-mds'
> 2019-04-02 12:55:07.618319 7ff8a0ecd700  1 mds.WXS0023  1: '-f'
> 2019-04-02 12:55:07.618320 7ff8a0ecd700  1 mds.WXS0023  2: '--cluster'
> 2019-04-02 12:55:07.618320 7ff8a0ecd700  1 mds.WXS0023  3: 'ceph'
> 2019-04-02 12:55:07.618321 7ff8a0ecd700  1 mds.WXS0023  4: '--id'
> 2019-04-02 12:55:07.618321 7ff8a0ecd700  1 mds.WXS0023  5: 'WXS0023'
> 2019-04-02 12:55:07.618322 7ff8a0ecd700  1 mds.WXS0023  6: '--setuser'
> 2019-04-02 12:55:07.618323 7ff8a0ecd700  1 mds.WXS0023  7: 'ceph'
> 2019-04-02 12:55:07.618323 7ff8a0ecd700  1 mds.WXS0023  8: '--setgroup'
> 2019-04-02 12:55:07.618325 7ff8a0ecd700  1 mds.WXS0023  9: 'ceph'
>