Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-14 Thread Massimo Sgaravatto
As I wrote here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html

I saw the same after an update from Luminous to Nautilus 14.2.6

Cheers, Massimo

On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan  wrote:

> Hi,
>
> I am getting one inconsistent object on our cluster with an inconsistency
> error that I haven’t seen before.  This started happening during a rolling
> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s
> related.
>
> I was hoping to know what the error means before trying a repair.
>
> [root@objmon04 ~]# ceph health detail
> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg
> inconsistent
> OSDMAP_FLAGS noout flag(s) set
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>
> rados list-inconsistent-obj 9.20e --format=json-pretty
> {
> "epoch": 759019,
> "inconsistents": [
> {
> "object": {
> "name":
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 692875
> },
> "errors": [
> "size_too_large"
> ],
> "union_shard_errors": [],
> "selected_object_info": {
> "oid": {
> "oid":
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "key": "",
> "snapid": -2,
> "hash": 3321413134,
> "max": 0,
> "pool": 9,
> "namespace": ""
> },
> "version": "281183'692875",
> "prior_version": "281183'692874",
> "last_reqid": "client.34042469.0:206759091",
> "user_version": 692875,
> "size": 146097278,
> "mtime": "2017-07-03 12:43:35.569986",
> "local_mtime": "2017-07-03 12:43:35.571196",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 0,
> "truncate_size": 0,
> "data_digest": "0xf19c8035",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> },
> "shards": [
> {
> "osd": 509,
> "primary": true,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 659,
> "primary": false,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 674,
> "primary": false,
> "errors": [],
> "size": 146097278
> }
> ]
> }
> ]
> }
>
> Thanks,
> Liam
> —
> Senior Developer
> Institute for Advanced Computer Studies
> University of Maryland
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] One lost cephfs data object

2020-01-14 Thread Andrew Denton
Hi all,

I'm on 13.2.6. My cephfs has managed to lose one single object from
it's data pool. All the cephfs docs I'm finding show me how to recover
from an entire lost PG, but the rest of the PG checks out as far as I
can tell. How can I track down which file does that object belongs to?
I'm missing "102e2aa.3721" in pg 16.d7. Pool 16 is an EC cephfs
data pool called cephfs_ecdata (this data pool is assigned to a
directory by ceph.dir.layout). We store backups in this data pool, so
we'll likely be fine just deleting the file.

# ceph health detail
HEALTH_ERR 60758/81263036 objects misplaced (0.075%); 1/16673236
objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound;
Degraded data redundancy: 1/81263036 objects degraded (0.000%), 1 pg
degraded
OBJECT_MISPLACED 60758/81263036 objects misplaced (0.075%)
OBJECT_UNFOUND 1/16673236 objects unfound (0.000%)
pg 16.d7 has 1 unfound objects
PG_DAMAGED Possible data damage: 1 pg recovery_unfound
pg 16.d7 is active+recovery_unfound+degraded+remapped, acting
[48,8,30,11,42], 1 unfound
PG_DEGRADED Degraded data redundancy: 1/81263036 objects degraded
(0.000%), 1 pg degraded
pg 16.d7 is active+recovery_unfound+degraded+remapped, acting
[48,8,30,11,42], 1 unfound


# ceph pg 16.d7 list_missing
{
"offset": {
"oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -9223372036854775808,
"namespace": ""
},
"num_missing": 1,
"num_unfound": 1,
"objects": [
{
"oid": {
"oid": "102e2aa.3721",
"key": "",
"snapid": -2,
"hash": 2685987031,
"max": 0,
"pool": 16,
"namespace": ""
},
"need": "41610'2203339",
"have": "0'0",
"flags": "none",
"locations": [
"42(4)"
]
}
],
"more": false
}

At one point this object showed it's map as

# ceph osd map cephfs_ecdata "102e2aa.3721"
osdmap e45659 pool 'cephfs_ecdata' (16) object '102e2aa.3721'
-> pg 16.a018e8d7 (16.d7) -> up ([48,52,30,11,44], p48) acting
([48,8,30,11,NONE], p48)

but I restarted osd.44, and now it's showing 

# ceph osd map cephfs_ecdata "102e2aa.3721"
osdmap e45679 pool 'cephfs_ecdata' (16) object '102e2aa.3721'
-> pg 16.a018e8d7 (16.d7) -> up ([48,52,30,11,44], p48) acting
([48,8,30,11,42], p48)

Thanks,
Andrew
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2020-01-14 Thread Stefan Kooman
Quoting Robert LeBlanc (rob...@leblancnet.us):
> 
> req_create
> req_getattr
> req_readdir
> req_lookupino
> req_open
> req_unlink
> 
> We were graphing these as ops, but using the new avgcount, we are getting
> very different values, so I'm wondering if we are choosing the wrong new
> value, or we misunderstood what the old value really was and have been
> plotting it wrong all this time.

I think the last one: not plotting what you think you did. We are using
the telegraf plugin from the manager and using "mds.request" from
"ceph_daemon_stats" to plot the number of requests. 

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool Max Avail and Ceph Dashboard Pool Useage on Nautilus giving different percentages

2020-01-14 Thread ceph
Does anyone know if this is also  respecting an nearfull values?

Thank you in advice
Mehmet

Am 14. Januar 2020 15:20:39 MEZ schrieb Stephan Mueller :
>Hi,
>I sent out this message on the 19th of December and somehow it didn't
>got into the list and I just noticed it now. Sorry for the delay.
>I tried to resend it but it just returned the same error that mail was
>not deliverable to the ceph mailing list. I will send the message
>beneath as soon it's finally possible, but for now this should help you
>out.
>
>Stephan
>
>--
>
>Hi,
>
>if "MAX AVAIL" displays the wrong data, the bug is just made more
>visible through the dashboard, as the calculation is correct.
>
>To get the right percentage you have to divide the used space through
>the total, and the total can only consist of two states used and not
>used space, so both states will be added together to get the total.
>
>Or in short:
>
>used / (avail + used)
>
>Just looked into the C++ code - Max avail will be calculated the
>following way:
>
>avail_res = avail / raw_used_rate (
>https://github.com/ceph/ceph/blob/nautilus/src/mon/PGMap.cc#L905)
>
>raw_used_rate *= (sum.num_object_copies - sum.num_objects_degraded) /
>sum.num_object_copies
>(https://github.com/ceph/ceph/blob/nautilus/src/mon/PGMap.cc#L892)
>
>
>Am Dienstag, den 17.12.2019, 07:07 +0100 schrieb c...@elchaka.de:
>> I have observed this in the ceph nautilus dashboard too - and Think
>> it is a Display Bug... but sometimes it Shows tue right values
>> 
>> 
>> Which nautilus u use?
>> 
>> 
>> Am 10. Dezember 2019 14:31:05 MEZ schrieb "David Majchrzak, ODERLAND
>> Webbhotell AB" :
>> > Hi!
>> > 
>> > While browsing /#/pool in nautilus ceph dashboard I noticed it said
>> > 93%
>> > used on the single pool we have (3x replica).
>> > 
>> > ceph df detal however shows 81% used on the pool and 67% raw
>> > useage.
>> > 
>> > # ceph df detail
>> > RAW STORAGE:
>> >CLASS SIZEAVAIL   USEDRAW USED %RAW
>> > USED 
>> >ssd   478 TiB 153 TiB 324 TiB  325
>> > TiB 67.96 
>> >TOTAL 478 TiB 153 TiB 324 TiB  325
>> > TiB 67.96 
>> > 
>> > POOLS:
>> >POOLID STORED  OBJECTS USED%USED
>> >
>> >  MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY  USED
>> > COMPR UNDER COMPR 
>> >echo  3 108 TiB  29.49M 324
>> > TiB 81.6124
>> > TiB N/A   N/A 29.49M0
>> > B 0 B
>
>I manually calculated the used percentage to get "avail" in your case
>it seems to be 73 TiB. That means the the total space available for
>your pool would be 397 TiB.
>I'm not sure why that is, but it's what the math behind those
>calculations say.
>(Found a thread regarding that on the new mailing list (ceph-
>us...@ceph.io) -> 
>
>
>https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/NH2LMMX5KVRWCURI3BARRUAETKE2T2QN/#JDHXOQKWF6NZLQMOGEPAQCLI44KB54A3
> )
>
>0.8161 = used (324) / total => total = 397
>
>Than I looked at the remaining calculations:
>
>raw_used_rate *= (sum.num_object_copies - sum.num_objects_degraded) /
>sum.num_object_copies
>
>and
>
>avail_res = avail / raw_used_rate 
>
>First I looked up the init value for "raw_used_rate" for replicated
>pools. It's their size so we can put in 3 here and for "avail_res" is
>24. 
>
>So I first calculated the final "raw_used_rate" which is 3.042. That
>means that you have around 4.2% degraded pg's in your pool.
>
>> > 
>> > 
>> > I know we're looking at the most full OSD (210PGs, 79% used, 1.17
>> > VAR)
>> > and count max avail from that. But where's the 93% full from in
>> > dashboard?
>
>As said above the calculation is right but the data is wrong... As it
>uses the real data that can be put in the selected pool, but it uses
>everywhere else sizes that consider all pool replicas.
>
>I created an issue to fix this https://tracker.ceph.com/issues/43384
>
>> > 
>> > My guess is that is comes from calculating: 
>> > 
>> > 1 - Max Avail / (Used + Max avail) = 0.93
>> > 
>> > 
>> > Kind Regards,
>> > 
>> > David Majchrzak
>> > 
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>Hope I could clarify some things and thanks for your feedback :)
>
>BTW this problem currently still exists as there wasn't any change to
>these mentioned lines after the nautilus release.
>
>Stephan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG inconsistent with error "size_too_large"

2020-01-14 Thread Liam Monahan
Hi,

I am getting one inconsistent object on our cluster with an inconsistency error 
that I haven’t seen before.  This started happening during a rolling upgrade of 
the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s related.

I was hoping to know what the error means before trying a repair.

[root@objmon04 ~]# ceph health detail
HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
inconsistent
OSDMAP_FLAGS noout flag(s) set
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 9.20e is active+clean+inconsistent, acting [509,674,659]

rados list-inconsistent-obj 9.20e --format=json-pretty
{
"epoch": 759019,
"inconsistents": [
{
"object": {
"name": 
"2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
"nspace": "",
"locator": "",
"snap": "head",
"version": 692875
},
"errors": [
"size_too_large"
],
"union_shard_errors": [],
"selected_object_info": {
"oid": {
"oid": 
"2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
"key": "",
"snapid": -2,
"hash": 3321413134,
"max": 0,
"pool": 9,
"namespace": ""
},
"version": "281183'692875",
"prior_version": "281183'692874",
"last_reqid": "client.34042469.0:206759091",
"user_version": 692875,
"size": 146097278,
"mtime": "2017-07-03 12:43:35.569986",
"local_mtime": "2017-07-03 12:43:35.571196",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf19c8035",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 509,
"primary": true,
"errors": [],
"size": 146097278
},
{
"osd": 659,
"primary": false,
"errors": [],
"size": 146097278
},
{
"osd": 674,
"primary": false,
"errors": [],
"size": 146097278
}
]
}
]
}

Thanks,
Liam
—
Senior Developer
Institute for Advanced Computer Studies
University of Maryland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2020-01-14 Thread Robert LeBlanc
On Tue, Jan 14, 2020 at 12:30 AM Stefan Kooman  wrote:

> Quoting Robert LeBlanc (rob...@leblancnet.us):
> > The link that you referenced above is no longer available, do you have a
> > new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all
> > changed, so I'm trying to may the old values to the new values. Might
> just
> > have to look in the code. :(
>
> I cannot recall that the metrics have ever changed between 12.2.8 and
> 12.2.12. Anyways, it depends on what module you use to collect the
> metrics if the right metrics are even there. See this issue:
> https://tracker.ceph.com/issues/41881


Yes, I agree that the metrics should not change within a major version, but
here is the difference. We are using diamond and the CephCollector, but I
verified with the admin socket and dumping the perf counters manually

Metrics collected with 12.2.8:
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_link 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookup 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookuphash 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupino 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupname 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupparent 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupsnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lssnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mkdir 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mknod 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mksnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_open 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_readdir 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rename 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_renamesnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmdir 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmsnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmxattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setdirlayout 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setfilelock 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setlayout 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setxattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_symlink 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_unlink 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.cap_revoke_eviction 0
1578955878

Metrics collected with 12.2.12: (much more clear and descriptive which is
good)
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgcount
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgtime
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.sum
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgcount
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgtime
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.sum
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgcount
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgtime
0 1578955878

Re: [ceph-users] where does 100% RBD utilization come from?

2020-01-14 Thread vitalif

Hi Philip,

I'm not sure if we're talking about the same thing but I was also 
confused when I didn't see 100% OSD drive utilization during my first 
RBD write benchmark. Since then I collect all my confusion here 
https://yourcmc.ru/wiki/Ceph_performance :)


100% RBD utilization means that something waits for some I/O ops on this 
device to complete all the time.


This "something" (client software) can't produce more I/O operations 
while it's waiting for previous ones to complete, that's why it can't 
saturate your OSDs and your network.


OSDs can't send more write requests to the drives while they're not done 
with calculating object states on the CPU or while they're busy with 
network I/O. That's why OSDs can't saturate drives.


Simply said: Ceph is slow. Partly because of the network roundtrips (you 
have 3 of them: client -> iscsi -> primary osd -> secondary osds), 
partly because it's just slow.


Of course it's not TERRIBLY slow, so software that can send I/O requests 
in batches (i.e. use async I/O) feels fine. But software that sends I/Os 
one by one (because of transactional requirements or just stupidity like 
Oracle) runs very slow.



Also..

"It seems like your RBD can't flush it's I/O fast enough"
implies that there is some particular measure of "fast enough", that
is a tunable value somewhere.
If my network cards arent blocked, and my OSDs arent blocked...
then doesnt that mean that I can and should "turn that knob" up?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Stefan Bauer
Thank you all,



performance is indeed better now. Can now go back to sleep ;)



KR



Stefan



-Ursprüngliche Nachricht-
Von: Виталий Филиппов 
Gesendet: Dienstag 14 Januar 2020 10:28
An: Wido den Hollander ; Stefan Bauer 
CC: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

...disable signatures and rbd cache. I didn't mention it in the email to not 
repeat myself. But I have it in the article :-)
--
With best regards,
Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread vitalif
Yes, that's it, see the end of the article. You'll have to disable 
signature checks, too.


cephx_require_signatures = false
cephx_cluster_require_signatures = false
cephx_sign_messages = false


Hi Vitaliy,

thank you for your time. Do you mean

cephx sign messages = false

with "diable signatures" ?

KR

Stefan


-Ursprüngliche Nachricht-
VON: Виталий Филиппов 
GESENDET: Dienstag 14 Januar 2020 10:28
AN: Wido den Hollander ; Stefan Bauer

CC: ceph-users@lists.ceph.com
BETREFF: Re: [ceph-users] low io with enterprise SSDs ceph luminous
- can we expect more? [klartext]

...disable signatures and rbd cache. I didn't mention it in the
email to not repeat myself. But I have it in the article :-)
--
With best regards,
Vitaliy Filippov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] where does 100% RBD utilization come from?

2020-01-14 Thread Philip Brown
Also..

"It seems like your RBD can't flush it's I/O fast enough"
implies that there is some particular measure of "fast enough", that is a 
tunable value somewhere.
If my network cards arent blocked, and my OSDs arent blocked...
then doesnt that mean that I can and should "turn that knob" up?


- Original Message -
From: "Wido den Hollander" 
To: "Philip Brown" , "ceph-users" 
Sent: Tuesday, January 14, 2020 12:42:48 AM
Subject: Re: [ceph-users] where does 100% RBD utilization come from?


The util is calculated based on average waits, see:
https://coderwall.com/p/utc42q/understanding-iostat

Just improving performance isn't just turning a knob and it will happen.
It seems like your RBD can't flush it's I/O fast enough and that causes
the iowait to go up.

This can be all kinds of things:

- Network (latency)
- CPU on the OSDs

Wido

> 
> 
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
> 5 Peters Canyon Rd Suite 250 
> Irvine CA 92606 
> Office 714.918.1310| Fax 714.918.1325 
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] where does 100% RBD utilization come from?

2020-01-14 Thread Philip Brown
The odd thing is:
the network interfaces on the gateways dont seem to be at 100% capacity
and the OSD disks dont seem to be at 100% utilization.
so I'm confused where this could be getting held up.




- Original Message -
From: "Wido den Hollander" 
To: "Philip Brown" , "ceph-users" 
Sent: Tuesday, January 14, 2020 12:42:48 AM
Subject: Re: [ceph-users] where does 100% RBD utilization come from?

 

The util is calculated based on average waits, see:
https://coderwall.com/p/utc42q/understanding-iostat

Just improving performance isn't just turning a knob and it will happen.
It seems like your RBD can't flush it's I/O fast enough and that causes
the iowait to go up.

This can be all kinds of things:

- Network (latency)
- CPU on the OSDs

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Stefan Bauer
Hi Vitaliy,



thank you for your time. Do you mean



cephx sign messages = false

with "diable signatures" ?



KR

Stefan





-Ursprüngliche Nachricht-
Von: Виталий Филиппов 
Gesendet: Dienstag 14 Januar 2020 10:28
An: Wido den Hollander ; Stefan Bauer 
CC: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

...disable signatures and rbd cache. I didn't mention it in the email to not 
repeat myself. But I have it in the article :-)
--
With best regards,
Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Lars Fenneberg
Hi Konstantin!

Quoting Konstantin Shalygin (k0...@k0ste.ru):

> >Is there any recommandation of how many osds a single flash device can
> >serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
> 
> Any sizes of db, except 3/30/300 is useless.

I have this from Mattia Belluco in my notes which suggests that twice the
amount is best:

> Following some discussions we had at the past Cephalocon I beg to differ
> on this point: when RocksDB needs to compact a layer it rewrites it
> *before* deleting the old data; if you'd like to be sure you db does not
> spill over to the spindle you should allocate twice the size of the
> biggest layer to allow for compaction. I guess ~60 GB would be the sweet
> spot assuming you don't plan to mess with size and multiplier of the
> rocksDB layers and don't want to go all the way to 600 GB (300 GB x2)

Source is 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035086.html.

And apart from the RocksDB pecularities the actual use case also needs to be
considered.  Lots of small files on a CephFS will require more DB space than
mainly big files as Paul states in the same thread.

Cheers,
LF.
-- 
Lars Fenneberg, l...@elemental.net
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Stefan Bauer
Hi Stefan,



thank you for your time.



"temporary write through" does not seem to be a legit parameter.



However write through is already set:



root@proxmox61:~# echo "temporary write through" > 
/sys/block/sdb/device/scsi_disk/*/cache_type
root@proxmox61:~# cat /sys/block/sdb/device/scsi_disk/2\:0\:0\:0/cache_type
write through



is that, what you meant?



Thank you.



KR



Stefan



-Ursprüngliche Nachricht-
Von: Stefan Priebe - Profihost AG 
 
this has something todo with the firmware and how the manufacturer
handles syncs / flushes.

Intel just ignores sync / flush commands for drives which have a
capacitor. Samsung does not.

The problem is that Ceph sends a lot of flush commands which slows down
drives without capacitor.

You can make linux to ignore those userspace requests with the following

command:
echo "temporary write through" >
/sys/block/sdX/device/scsi_disk/*/cache_type

Greets,
Stefan Priebe
Profihost AG


> Thank you.
>
>
> Stefan
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistents because of "size_too_large"

2020-01-14 Thread Massimo Sgaravatto
This is what I see in the OSD.54 log file

2020-01-14 10:35:04.986 7f0c20dca700 -1 log_channel(cluster) log [ERR] :
13.4 soid
13:20fbec66:::%2fhbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM%2ffile2:head :
size 385888256 > 134217728 is too large
2020-01-14 10:35:08.534 7f0c20dca700 -1 log_channel(cluster) log [ERR] :
13.4 soid
13:25e2d1bd:::%2fhbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM%2ffile8:head :
size 385888256 > 134217728 is too large

On Tue, Jan 14, 2020 at 11:02 AM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> I have just finished the update of a ceph cluster from luminous to nautilus
> Everything seems running, but I keep receiving notifications (about ~ 10
> so far, involving different PGs and different OSDs)  of PGs in inconsistent
> state.
>
> rados list-inconsistent-obj pg-id --format=json-pretty  (an example is
> attached) says that the problem is "size_too_large".
>
> "ceph pg repair" is able to "fix" the problem, but I am not able to
> understand what is the problem
>
> Thanks, Massimo
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs inconsistents because of "size_too_large"

2020-01-14 Thread Massimo Sgaravatto
I have just finished the update of a ceph cluster from luminous to nautilus
Everything seems running, but I keep receiving notifications (about ~ 10 so
far, involving different PGs and different OSDs)  of PGs in inconsistent
state.

rados list-inconsistent-obj pg-id --format=json-pretty  (an example is
attached) says that the problem is "size_too_large".

"ceph pg repair" is able to "fix" the problem, but I am not able to
understand what is the problem

Thanks, Massimo
{
"epoch": 1966551,
"inconsistents": [
{
"object": {
"name": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file2",
"nspace": "",
"locator": "",
"snap": "head",
"version": 368
},
"errors": [
"size_too_large"
],
"union_shard_errors": [],
"selected_object_info": {
"oid": {
"oid": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file2",
"key": "",
"snapid": -2,
"hash": 1714937604,
"max": 0,
"pool": 13,
"namespace": ""
},
"version": "243582'368",
"prior_version": "243582'367",
"last_reqid": "client.13143063.0:20504",
"user_version": 368,
"size": 385888256,
"mtime": "2017-10-10 14:09:12.098334",
"local_mtime": "2017-10-10 14:10:29.321446",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0x9229f11b",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 13,
"primary": false,
"errors": [],
"size": 385888256,
"omap_digest": "0x",
"data_digest": "0x9229f11b"
},
{
"osd": 38,
"primary": false,
"errors": [],
"size": 385888256,
"omap_digest": "0x",
"data_digest": "0x9229f11b"
},
{
"osd": 54,
"primary": true,
"errors": [],
"size": 385888256,
"omap_digest": "0x",
"data_digest": "0x9229f11b"
}
]
},
{
"object": {
"name": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file8",
"nspace": "",
"locator": "",
"snap": "head",
"version": 417
},
"errors": [
"size_too_large"
],
"union_shard_errors": [],
"selected_object_info": {
"oid": {
"oid": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file8",
"key": "",
"snapid": -2,
"hash": 3180021668,
"max": 0,
"pool": 13,
"namespace": ""
},
"version": "243596'417",
"prior_version": "243596'416",
"last_reqid": "client.13143063.0:20858",
"user_version": 417,
"size": 385888256,
"mtime": "2017-10-10 14:16:32.814931",
"local_mtime": "2017-10-10 14:17:50.248174",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0x9229f11b",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 13,
"primary": false,
"errors": [],
"size": 385888256,
"omap_digest": "0x",
"data_digest": "0x9229f11b"
},
{
   

Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Konstantin Shalygin

i'm plannung to split the block db to a seperate flash device which i
also would like to use as an OSD for erasure coding metadata for rbd
devices.

If i want to use 14x 14TB HDDs per Node
https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

recommends a minimum size of 140GB per 14TB HDD.

Is there any recommandation of how many osds a single flash device can
serve? The optane ones can do 2000MB/s write + 500.000 iop/s.


Any sizes of db, except 3/30/300 is useless.

How much OSD's per NVMe - quantity of OSD's that you can lose once at time.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Виталий Филиппов
...disable signatures and rbd cache. I didn't mention it in the email to not 
repeat myself. But I have it in the article :-)
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Xiaoxi Chen
One tricky thing is each layer of RocksDB is 100% on SSD or 100% on HDD,
so either you need to tweak the rocksdb configuration , or there will be a
huge waste,  e.g  20GB DB partition makes no difference compared to a 3GB
one (under default rocksdb configuration)

Janne Johansson  于2020年1月14日周二 下午4:43写道:

> (sorry for empty mail just before)
>
>
>> i'm plannung to split the block db to a seperate flash device which i
>>> also would like to use as an OSD for erasure coding metadata for rbd
>>> devices.
>>>
>>> If i want to use 14x 14TB HDDs per Node
>>>
>>> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>>>
>>> recommends a minimum size of 140GB per 14TB HDD.
>>>
>>> Is there any recommandation of how many osds a single flash device can
>>> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
>>>
>>
>>
> I think many ceph admins are more concerned with having many drives
> co-using the same DB drive, since if the DB drive fails, it also means all
> OSDs are lost at the same time.
> Optanes and decent NVMEs are probably capable of handling tons of HDDs, so
> that the bottleneck ends up being somewhere else, but the failure scenarios
> are a bit scary if the whole host is lost just by that one DB device acting
> up.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hardware selection for ceph backup on ceph

2020-01-14 Thread Wido den Hollander


On 1/10/20 5:32 PM, Stefan Priebe - Profihost AG wrote:
> Hi,
> 
> we‘re currently in the process of building a new ceph cluster to backup rbd 
> images from multiple ceph clusters.
> 
> We would like to start with just a single ceph cluster to backup which is 
> about 50tb. Compression ratio of the data is around 30% while using zlib. We 
> need to scale the backup cluster up to 1pb.
> 
> The workload on the original rbd images is mostly 4K writes so I expect rbd 
> export-diff to do a lot of small writes.
> 
> The current idea is to use the following hw as a start:
> 6 Servers with:
>  1 AMD EPYC 7302P 3GHz, 16C/32T
> 128g Memory
> 14x 12tb Toshiba Enterprise MG07ACA HDD drives 4K native 
> Dual 25gb network
> 

That should be sufficient. The AMD Epyc is a great CPU and you have
enough memory.

> Does it fit? Has anybody experience with the drives? Can we use EC or do we 
> need to use normal replication?
> 

EC will just work. It will be fast enough. But since it's only a backup
system it should work out.

Oh, more servers is always better.

Wido

> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Janne Johansson
(sorry for empty mail just before)


> i'm plannung to split the block db to a seperate flash device which i
>> also would like to use as an OSD for erasure coding metadata for rbd
>> devices.
>>
>> If i want to use 14x 14TB HDDs per Node
>>
>> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>>
>> recommends a minimum size of 140GB per 14TB HDD.
>>
>> Is there any recommandation of how many osds a single flash device can
>> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
>>
>
>
I think many ceph admins are more concerned with having many drives
co-using the same DB drive, since if the DB drive fails, it also means all
OSDs are lost at the same time.
Optanes and decent NVMEs are probably capable of handling tons of HDDs, so
that the bottleneck ends up being somewhere else, but the failure scenarios
are a bit scary if the whole host is lost just by that one DB device acting
up.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] where does 100% RBD utilization come from?

2020-01-14 Thread Wido den Hollander



On 1/10/20 7:43 PM, Philip Brown wrote:
> Surprisingly, a google search didnt seem to find the answer on this, so guess 
> I should ask here:
> 
> what determines if an rdb is "100% busy"?
> 
> I have some backend OSDs, and an iSCSI gateway, serving out some RBDs.
> 
> iostat on the gateway says rbd is 100% utilized
> 
> iostat on individual OSds only goes as high as about 60% on a per-device 
> basis.
> CPU is idle.
> Doesnt seem like network interface is capped either.
> 
> So.. how do I improve RBD throughput?
> 

The util is calculated based on average waits, see:
https://coderwall.com/p/utc42q/understanding-iostat

Just improving performance isn't just turning a knob and it will happen.
It seems like your RBD can't flush it's I/O fast enough and that causes
the iowait to go up.

This can be all kinds of things:

- Network (latency)
- CPU on the OSDs

Wido

> 
> 
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
> 5 Peters Canyon Rd Suite 250 
> Irvine CA 92606 
> Office 714.918.1310| Fax 714.918.1325 
> pbr...@medata.com| www.medata.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Janne Johansson
Den mån 13 jan. 2020 kl 08:09 skrev Stefan Priebe - Profihost AG <
s.pri...@profihost.ag>:

> Hello,
>
> i'm plannung to split the block db to a seperate flash device which i
> also would like to use as an OSD for erasure coding metadata for rbd
> devices.
>
> If i want to use 14x 14TB HDDs per Node
>
> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>
> recommends a minimum size of 140GB per 14TB HDD.
>
> Is there any recommandation of how many osds a single flash device can
> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-14 Thread Wido den Hollander



On 1/13/20 6:37 PM, vita...@yourcmc.ru wrote:
>> Hi,
>>
>> we're playing around with ceph but are not quite happy with the IOs.
>> on average 5000 iops / write
>> on average 13000 iops / read
>>
>> We're expecting more. :( any ideas or is that all we can expect?
> 
> With server SSD you can expect up to ~1 write / ~25000 read iops per
> a single client.
> 
> https://yourcmc.ru/wiki/Ceph_performance
> 
>> money is NOT a problem for this test-bed, any ideas howto gain more
>> IOS is greatly appreciated.
> 
> Grab some server NVMes and best possible CPUs :)

And then:

- Disable all powersaving
- Pin the CPUs in C-State 1

That might even increase performance even more. But due to the
synchronous nature of Ceph the performance and latency of a single
thread will be limited.

Wido

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Stefan Priebe - Profihost AG
Hello,

does anybody have real live experience with externel block db?

Greets,
Stefan
Am 13.01.20 um 08:09 schrieb Stefan Priebe - Profihost AG:
> Hello,
> 
> i'm plannung to split the block db to a seperate flash device which i
> also would like to use as an OSD for erasure coding metadata for rbd
> devices.
> 
> If i want to use 14x 14TB HDDs per Node
> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> 
> recommends a minimum size of 140GB per 14TB HDD.
> 
> Is there any recommandation of how many osds a single flash device can
> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
> 
> Greets,
> Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2020-01-14 Thread Stefan Kooman
Quoting Robert LeBlanc (rob...@leblancnet.us):
> The link that you referenced above is no longer available, do you have a
> new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all
> changed, so I'm trying to may the old values to the new values. Might just
> have to look in the code. :(

I cannot recall that the metrics have ever changed between 12.2.8 and
12.2.12. Anyways, it depends on what module you use to collect the
metrics if the right metrics are even there. See this issue:
https://tracker.ceph.com/issues/41881

...

The "(avg)count" metric is needed to perform calculations to obtain
"avgtime" (sum/avgcount).

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com