Re: [ceph-users] RBD: How many snapshots is too many?
On Fri, Sep 8, 2017 at 5:47 PM, Mclean, Patrick wrote: > On a related note, we are very curious why the snapshot id is > incremented when a snapshot is deleted, this creates lots > phantom entries in the deleted snapshots set. Interleaved > deletions and creations will cause massive fragmentation in > the interval set. The only reason we can come up for this > is to track if anything changed, but I suspect a different > value that doesn't inject entries in to the interval set might > be better for this purpose. Yes, it's because having a sequence number tied in with the snapshots is convenient for doing comparisons. Those aren't leaked snapids that will make holes; when we increment the snapid to delete something we also stick it in the removed_snaps set. (I suppose if you alternate deleting a snapshot with adding one that does increase the size until you delete those snapshots; hrmmm. Another thing to avoid doing I guess.) >> It might really just be the osdmap update processing -- that would >> make me happy as it's a much easier problem to resolve. But I'm also >> surprised it's *that* expensive, even at the scales you've described. > That would be nice, but unfortunately all the data is pointing > to PGPool::Update(), Yes, that's the OSDMap update processing I referred to. This is good in terms of our ability to remove it without changing client interfaces and things. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD: How many snapshots is too many?
On 2017-09-08 01:36 PM, Gregory Farnum wrote: > On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick > wrote: >> On 2017-09-05 02:41 PM, Gregory Farnum wrote: >>> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas > wrote: >>> >> Hi everyone, >> >> with the Luminous release out the door >> and the Labor Day weekend >> over, I hope I can kick off a discussion on >> another issue that has >> irked me a bit for quite a while. There >> doesn't seem to be a good >> documented answer to this: what are Ceph's >> real limits when it >> comes to RBD snapshots? >> >> For most people, >> any RBD image will have perhaps a single-digit >> number of snapshots. >> For example, in an OpenStack environment we >> typically have one >> snapshot per Glance image, a few snapshots per >> Cinder volume, and >> perhaps a few snapshots per ephemeral Nova disk >> (unless clones are >> configured to flatten immediately). Ceph >> generally performs well >> under those circumstances. >> >> However, things sometimes start getting >> problematic when RBD >> snapshots are generated frequently, and in an >> automated fashion. >> I've seen Ceph operators configure snapshots on a >> daily or even >> hourly basis, typically when using snapshots as a >> backup strategy >> (where they promise to allow for very short RTO and >> RPO). In >> combination with thousands or maybe tens of thousands of >> RBDs, >> that's a lot of snapshots. And in such scenarios (and only in those), users have been bitten by a few nasty bugs in the past — >> >> here's an example where the OSD snap trim queue went berserk in the >> >> event of lots of snapshots being deleted: >> >> >> http://tracker.ceph.com/issues/9487 >> >> https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to >> me that there still isn't a good recommendation along >> the lines of >> "try not to have more than X snapshots per RBD image" >> or "try not to >> have more than Y snapshots in the cluster overall". >> Or is the >> "correct" recommendation actually "create as many >> snapshots as you >> might possibly want, none of that is allowed to >> create any >> instability nor performance degradation and if it does, >> that's a >> bug"? > > I think we're closer to "as many snapshots as you want", but >> there > are some known shortages there. > > First of all, if you haven't >> seen my talk from the last OpenStack > summit on snapshots and you want >> a bunch of details, go watch that. > :p > >> https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1 >> >> There are a few dimensions there can be failures with snapshots: >> >>> 1) right now the way we mark snapshots as deleted is suboptimal — > when >>> deleted they go into an interval_set in the OSDMap. So if you > >> have a bunch of holes in your deleted snapshots, it is possible to > >> inflate the osdmap to a size which causes trouble. But I'm not sure > if >> we've actually seen this be an issue yet — it requires both a > large >> cluster, and a large map, and probably some other failure > causing >> osdmaps to be generated very rapidly. >> In our use case, we are severly hampered by the size of removed_snaps >> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in >> PGPool::update and its interval calculation code. We have a cluster of >> around 100k RBDs with each RBD having upto 25 snapshots and only a small >> portion of our RBDs mapped at a time (~500-1000). For size / performance >> reasons we try to keep the number of snapshots low (<25) and need to >> prune snapshots. Since in our use case RBDs 'age' at different rates, >> snapshot pruning creates holes to the point where we the size of the >> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph >> clusters. I think in general around 2 snapshot removal operations >> currently happen a minute just because of the volume of snapshots and >> users we have. >> >> We found the PGPool::update and the interval calculation code code to be >> quite inefficient. Some small changes made it a lot faster giving more >> breathing room, we shared and these and most already got applied: >> https://github.com/ceph/ceph/pull/17088 >> https://github.com/ceph/ceph/pull/17121 >> https://github.com/ceph/ceph/pull/17239 >> https://github.com/ceph/ceph/pull/17265 >> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) >> >> However for our use case these patches helped, but overall CPU usage in >> this area is still high (>70% or so), making the Ceph cluster slow >> causing blocked requests and many operations (e.g. rbd map) to take a >> long time. >> >> We are trying to work around these issues by trying to change our >> snapshot strategy. In the short-term we are manually defragmenting the >> interval set by scanning for holes and trying to delete snapids in >> between holes to coalesce more holes. This is not so nice to do. In some >> cases we employ strategies to 'recreate' old snapshots (as we need to >> ke
Re: [ceph-users] RBD: How many snapshots is too many?
On 2017-09-08 01:59 PM, Gregory Farnum wrote: > On Fri, Sep 8, 2017 at 1:45 AM, Florian Haas wrote: >>> In our use case, we are severly hampered by the size of removed_snaps >>> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in >>> PGPool::update and its interval calculation code. We have a cluster of >>> around 100k RBDs with each RBD having upto 25 snapshots and only a small >>> portion of our RBDs mapped at a time (~500-1000). For size / performance >>> reasons we try to keep the number of snapshots low (<25) and need to >>> prune snapshots. Since in our use case RBDs 'age' at different rates, >>> snapshot pruning creates holes to the point where we the size of the >>> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph >>> clusters. I think in general around 2 snapshot removal operations >>> currently happen a minute just because of the volume of snapshots and >>> users we have. >> Right. Greg, this is what I was getting at: 25 snapshots per RBD is >> firmly in "one snapshot per day per RBD" territory — this is something >> that a cloud operator might do, for example, offering daily snapshots >> going back one month. But it still wrecks the cluster simply by having >> lots of images (even though only a fraction of them, less than 1%, are >> ever in use). That's rather counter-intuitive, it doesn't hit you >> until you have lots of images, and once you're affected by it there's >> no practical way out — where "out" is defined as "restoring overall >> cluster performance to something acceptable". >> >>> We found the PGPool::update and the interval calculation code code to be >>> quite inefficient. Some small changes made it a lot faster giving more >>> breathing room, we shared and these and most already got applied: >>> https://github.com/ceph/ceph/pull/17088 >>> https://github.com/ceph/ceph/pull/17121 >>> https://github.com/ceph/ceph/pull/17239 >>> https://github.com/ceph/ceph/pull/17265 >>> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) >>> >>> However for our use case these patches helped, but overall CPU usage in >>> this area is still high (>70% or so), making the Ceph cluster slow >>> causing blocked requests and many operations (e.g. rbd map) to take a >>> long time. >> I think this makes this very much a practical issue, not a >> hypothetical/theoretical one. >> >>> We are trying to work around these issues by trying to change our >>> snapshot strategy. In the short-term we are manually defragmenting the >>> interval set by scanning for holes and trying to delete snapids in >>> between holes to coalesce more holes. This is not so nice to do. In some >>> cases we employ strategies to 'recreate' old snapshots (as we need to >>> keep them) at higher snapids. For our use case a 'snapid rename' feature >>> would have been quite helpful. >>> >>> I hope this shines some light on practical Ceph clusters in which >>> performance is bottlenecked not by I/O but by snapshot removal. >> For others following this thread or retrieving it from the list >> archive some time down the road, I'd rephrase that as "bottlenecked >> not by I/O but by CPU utilization associated with snapshot removal". >> Is that fair to say, Patrick? Please correct me if I'm >> misrepresenting. >> >> Greg (or Josh/Jason/Sage/anyone really :) ), can you provide >> additional insight as to how these issues can be worked around or >> mitigated, besides the PRs that Patrick and his colleagues have >> already sent? > Yeah. Like I said, we have a proposed solution for this (that we can > probably backport to Luminous stable?), but that's the sort of thing I > haven't heard about before. And the issue is indeed with the raw size > of the removed_snaps member, which will be a problem for cloud > operators of a certain scale. > > Theoretically, I'd expect you could control it if you are careful: > 1) take all snapshots on your RBD images for a single time unit > together, don't intersperse them (ie, don't create up daily snapshots > on some images at the same time as hourly snapshots on others) > 2) trim all snapshots from the same time unit on the same schedule > 3) limit the number of live time units you keep around That is basically our long term strategy, but it does involve some re-architecting of our code, which does take some time. > There are obvious downsides to those steps, and it's a problem I look > forward to us resolving soonish. But if you follow those I'd expect > the removed_snaps interval_set to be proportional in size to the > number of live time units you have, rather than the number of RBD > volumes or anything else. > > > > On Wed, Sep 6, 2017 at 8:44 AM, Florian Haas wrote: >> Hi Greg, >> >> thanks for your insight! I do have a few follow-up questions. >> >> On 09/05/2017 11:39 PM, Gregory Farnum wrote: It seems to me that there still isn't a good recommendation along the lines of "try not to have more than X snapshots per RBD image" or "t
Re: [ceph-users] RBD: How many snapshots is too many?
On Fri, Sep 8, 2017 at 1:45 AM, Florian Haas wrote: >> In our use case, we are severly hampered by the size of removed_snaps >> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in >> PGPool::update and its interval calculation code. We have a cluster of >> around 100k RBDs with each RBD having upto 25 snapshots and only a small >> portion of our RBDs mapped at a time (~500-1000). For size / performance >> reasons we try to keep the number of snapshots low (<25) and need to >> prune snapshots. Since in our use case RBDs 'age' at different rates, >> snapshot pruning creates holes to the point where we the size of the >> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph >> clusters. I think in general around 2 snapshot removal operations >> currently happen a minute just because of the volume of snapshots and >> users we have. > > Right. Greg, this is what I was getting at: 25 snapshots per RBD is > firmly in "one snapshot per day per RBD" territory — this is something > that a cloud operator might do, for example, offering daily snapshots > going back one month. But it still wrecks the cluster simply by having > lots of images (even though only a fraction of them, less than 1%, are > ever in use). That's rather counter-intuitive, it doesn't hit you > until you have lots of images, and once you're affected by it there's > no practical way out — where "out" is defined as "restoring overall > cluster performance to something acceptable". > >> We found the PGPool::update and the interval calculation code code to be >> quite inefficient. Some small changes made it a lot faster giving more >> breathing room, we shared and these and most already got applied: >> https://github.com/ceph/ceph/pull/17088 >> https://github.com/ceph/ceph/pull/17121 >> https://github.com/ceph/ceph/pull/17239 >> https://github.com/ceph/ceph/pull/17265 >> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) >> >> However for our use case these patches helped, but overall CPU usage in >> this area is still high (>70% or so), making the Ceph cluster slow >> causing blocked requests and many operations (e.g. rbd map) to take a >> long time. > > I think this makes this very much a practical issue, not a > hypothetical/theoretical one. > >> We are trying to work around these issues by trying to change our >> snapshot strategy. In the short-term we are manually defragmenting the >> interval set by scanning for holes and trying to delete snapids in >> between holes to coalesce more holes. This is not so nice to do. In some >> cases we employ strategies to 'recreate' old snapshots (as we need to >> keep them) at higher snapids. For our use case a 'snapid rename' feature >> would have been quite helpful. >> >> I hope this shines some light on practical Ceph clusters in which >> performance is bottlenecked not by I/O but by snapshot removal. > > For others following this thread or retrieving it from the list > archive some time down the road, I'd rephrase that as "bottlenecked > not by I/O but by CPU utilization associated with snapshot removal". > Is that fair to say, Patrick? Please correct me if I'm > misrepresenting. > > Greg (or Josh/Jason/Sage/anyone really :) ), can you provide > additional insight as to how these issues can be worked around or > mitigated, besides the PRs that Patrick and his colleagues have > already sent? Yeah. Like I said, we have a proposed solution for this (that we can probably backport to Luminous stable?), but that's the sort of thing I haven't heard about before. And the issue is indeed with the raw size of the removed_snaps member, which will be a problem for cloud operators of a certain scale. Theoretically, I'd expect you could control it if you are careful: 1) take all snapshots on your RBD images for a single time unit together, don't intersperse them (ie, don't create up daily snapshots on some images at the same time as hourly snapshots on others) 2) trim all snapshots from the same time unit on the same schedule 3) limit the number of live time units you keep around There are obvious downsides to those steps, and it's a problem I look forward to us resolving soonish. But if you follow those I'd expect the removed_snaps interval_set to be proportional in size to the number of live time units you have, rather than the number of RBD volumes or anything else. On Wed, Sep 6, 2017 at 8:44 AM, Florian Haas wrote: > Hi Greg, > > thanks for your insight! I do have a few follow-up questions. > > On 09/05/2017 11:39 PM, Gregory Farnum wrote: >>> It seems to me that there still isn't a good recommendation along the >>> lines of "try not to have more than X snapshots per RBD image" or "try >>> not to have more than Y snapshots in the cluster overall". Or is the >>> "correct" recommendation actually "create as many snapshots as you >>> might possibly want, none of that is allowed to create any instability >>> nor performance degradation and if it does
Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9
Robin, Would you generate the values and keys for the various versions of at least one of the objects? .dir.default.292886573.13181.12 is a good example because there are 3 variations for the same object. If there isn't much activity to .dir.default.64449186.344176, you could do one osd at a time. Otherwise, stop all 3 OSDs 1322, 990, 655 execute these for all 3. I suspect you'll need to pipe to "od-cx" to get printable output. I created a simple object with ascii omap. $ ceph-objectstore-tool --data-path ... --pgid 5.3d40 .dir.default.64449186.344176 get-omaphdr obj_header $ for i in $(ceph-objectstore-tool --data-path ... --pgid 5.3d40 .dir.default.64449186.344176 list-omap) do echo -n "${i}: " ceph-objectstore-tool --data-path ... .dir.default.292886573.13181.12 get-omap $i done key1: val1 key2: val2 key3: val3 David On 9/8/17 12:18 PM, David Zafman wrote: Robin, The only two changesets I can spot in Jewel that I think might be related are these: 1. http://tracker.ceph.com/issues/20089 https://github.com/ceph/ceph/pull/15416 This should improve the repair functionality. 2. http://tracker.ceph.com/issues/19404 https://github.com/ceph/ceph/pull/14204 This pull request fixes an issue that corrupted omaps. It also finds and repairs them. However, the repair process might resurrect deleted omaps which would show up as an omap digest error. This could temporarily cause additional inconsistent PGs. So if this has NOT been occurring longer than your deep-scrub interval since upgrading, I'd repair the pgs and monitor going forward to make sure the problem doesn't recur. --- You have good example of repair scenarios: .dir.default.292886573.13181.12 only has a omap_digest_mismatch and no shard errors. The automatic repair won't be sure which is a good copy. In this case we can see that osd 1327 doesn't match the other two. To assist the repair process to repair the right one. Remove the copy on osd.1327 Stop osd 1327 and use "ceph-objectstore-tool --data-path .1327 .dir.default.292886573.13181.12 remove" .dir.default.64449186.344176 has selected_object_info with "od 337cf025" so shards have "omap_digest_mismatch_oi" except for osd 990. The pg repair code will use osd.990 to fix the other 2 copies without further handling. David On 9/8/17 11:16 AM, Robin H. Johnson wrote: On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote: pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655] pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91] Here is the output of 'rados list-inconsistent-obj' for the PGs: $ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt canonical,pretty { "epoch" : 1221254, "inconsistents" : [ { "errors" : [ "omap_digest_mismatch" ], "object" : { "locator" : "", "name" : ".dir.default.292886573.13181.12", "nspace" : "", "snap" : "head", "version" : 483490 }, "selected_object_info" : "5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 client.417313345.0:19515832 dirty|omap|data_digest s 0 uv 483490 dd alloc_hint [0 0])", "shards" : [ { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x928b0c0b", "osd" : 91, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x928b0c0b", "osd" : 631, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x6556c868", "osd" : 1327, "size" : 0 } ], "union_shard_errors" : [] } ] } $ sudo rados list-inconsistent-obj 5.3d40 |json_pp -json_opt canonical,pretty { "epoch" : 1210895, "inconsistents" : [ { "errors" : [ "omap_digest_mismatch" ], "object" : { "locator" : "", "name" : ".dir.default.64449186.344176", "nspace" : "", "snap" : "head", "version" : 1177199 }, "selected_object_info" : "5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd od 337cf025 alloc_hint [0 0])", "shards" : [ { "data_digest" : "0x", "errors" : [ "omap_digest_mismatch_oi" ], "omap_digest" : "0x3242b04e", "osd" : 655, "size" : 0 }, { "data_digest" : "0x",
Re: [ceph-users] RBD: How many snapshots is too many?
On Thu, Sep 7, 2017 at 1:46 PM, Mclean, Patrick wrote: > On 2017-09-05 02:41 PM, Gregory Farnum wrote: >> On Tue, Sep 5, 2017 at 1:44 PM, Florian Haas > wrote: >> >> Hi everyone, >> >> with the Luminous release out the door > and the Labor Day weekend >> over, I hope I can kick off a discussion on > another issue that has >> irked me a bit for quite a while. There > doesn't seem to be a good >> documented answer to this: what are Ceph's > real limits when it >> comes to RBD snapshots? >> >> For most people, > any RBD image will have perhaps a single-digit >> number of snapshots. > For example, in an OpenStack environment we >> typically have one > snapshot per Glance image, a few snapshots per >> Cinder volume, and > perhaps a few snapshots per ephemeral Nova disk >> (unless clones are > configured to flatten immediately). Ceph >> generally performs well > under those circumstances. >> >> However, things sometimes start getting > problematic when RBD >> snapshots are generated frequently, and in an > automated fashion. >> I've seen Ceph operators configure snapshots on a > daily or even >> hourly basis, typically when using snapshots as a > backup strategy >> (where they promise to allow for very short RTO and > RPO). In >> combination with thousands or maybe tens of thousands of > RBDs, >> that's a lot of snapshots. And in such scenarios (and only in >>> those), users have been bitten by a few nasty bugs in the past — >> > here's an example where the OSD snap trim queue went berserk in the >> > event of lots of snapshots being deleted: >> >> > http://tracker.ceph.com/issues/9487 >> > https://www.spinics.net/lists/ceph-devel/msg20470.html >> >> It seems to > me that there still isn't a good recommendation along >> the lines of > "try not to have more than X snapshots per RBD image" >> or "try not to > have more than Y snapshots in the cluster overall". >> Or is the > "correct" recommendation actually "create as many >> snapshots as you > might possibly want, none of that is allowed to >> create any > instability nor performance degradation and if it does, >> that's a > bug"? > > I think we're closer to "as many snapshots as you want", but > there > are some known shortages there. > > First of all, if you haven't > seen my talk from the last OpenStack > summit on snapshots and you want > a bunch of details, go watch that. > :p > > https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1 > > There are a few dimensions there can be failures with snapshots: > >> 1) right now the way we mark snapshots as deleted is suboptimal — > when >> deleted they go into an interval_set in the OSDMap. So if you > > have a bunch of holes in your deleted snapshots, it is possible to > > inflate the osdmap to a size which causes trouble. But I'm not sure > if > we've actually seen this be an issue yet — it requires both a > large > cluster, and a large map, and probably some other failure > causing > osdmaps to be generated very rapidly. > In our use case, we are severly hampered by the size of removed_snaps > (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in > PGPool::update and its interval calculation code. We have a cluster of > around 100k RBDs with each RBD having upto 25 snapshots and only a small > portion of our RBDs mapped at a time (~500-1000). For size / performance > reasons we try to keep the number of snapshots low (<25) and need to > prune snapshots. Since in our use case RBDs 'age' at different rates, > snapshot pruning creates holes to the point where we the size of the > removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph > clusters. I think in general around 2 snapshot removal operations > currently happen a minute just because of the volume of snapshots and > users we have. > > We found the PGPool::update and the interval calculation code code to be > quite inefficient. Some small changes made it a lot faster giving more > breathing room, we shared and these and most already got applied: > https://github.com/ceph/ceph/pull/17088 > https://github.com/ceph/ceph/pull/17121 > https://github.com/ceph/ceph/pull/17239 > https://github.com/ceph/ceph/pull/17265 > https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) > > However for our use case these patches helped, but overall CPU usage in > this area is still high (>70% or so), making the Ceph cluster slow > causing blocked requests and many operations (e.g. rbd map) to take a > long time. > > We are trying to work around these issues by trying to change our > snapshot strategy. In the short-term we are manually defragmenting the > interval set by scanning for holes and trying to delete snapids in > between holes to coalesce more holes. This is not so nice to do. In some > cases we employ strategies to 'recreate' old snapshots (as we need to > keep them) at higher snapids. For our use case a 'snapid rename' feature > would have been quite helpful. > > I hope this shines so
Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9
Robin, The only two changesets I can spot in Jewel that I think might be related are these: 1. http://tracker.ceph.com/issues/20089 https://github.com/ceph/ceph/pull/15416 This should improve the repair functionality. 2. http://tracker.ceph.com/issues/19404 https://github.com/ceph/ceph/pull/14204 This pull request fixes an issue that corrupted omaps. It also finds and repairs them. However, the repair process might resurrect deleted omaps which would show up as an omap digest error. This could temporarily cause additional inconsistent PGs. So if this has NOT been occurring longer than your deep-scrub interval since upgrading, I'd repair the pgs and monitor going forward to make sure the problem doesn't recur. --- You have good example of repair scenarios: .dir.default.292886573.13181.12 only has a omap_digest_mismatch and no shard errors. The automatic repair won't be sure which is a good copy. In this case we can see that osd 1327 doesn't match the other two. To assist the repair process to repair the right one. Remove the copy on osd.1327 Stop osd 1327 and use "ceph-objectstore-tool --data-path .1327 .dir.default.292886573.13181.12 remove" .dir.default.64449186.344176 has selected_object_info with "od 337cf025" so shards have "omap_digest_mismatch_oi" except for osd 990. The pg repair code will use osd.990 to fix the other 2 copies without further handling. David On 9/8/17 11:16 AM, Robin H. Johnson wrote: On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote: pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655] pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91] Here is the output of 'rados list-inconsistent-obj' for the PGs: $ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt canonical,pretty { "epoch" : 1221254, "inconsistents" : [ { "errors" : [ "omap_digest_mismatch" ], "object" : { "locator" : "", "name" : ".dir.default.292886573.13181.12", "nspace" : "", "snap" : "head", "version" : 483490 }, "selected_object_info" : "5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 client.417313345.0:19515832 dirty|omap|data_digest s 0 uv 483490 dd alloc_hint [0 0])", "shards" : [ { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x928b0c0b", "osd" : 91, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x928b0c0b", "osd" : 631, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x6556c868", "osd" : 1327, "size" : 0 } ], "union_shard_errors" : [] } ] } $ sudo rados list-inconsistent-obj 5.3d40 |json_pp -json_opt canonical,pretty { "epoch" : 1210895, "inconsistents" : [ { "errors" : [ "omap_digest_mismatch" ], "object" : { "locator" : "", "name" : ".dir.default.64449186.344176", "nspace" : "", "snap" : "head", "version" : 1177199 }, "selected_object_info" : "5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd od 337cf025 alloc_hint [0 0])", "shards" : [ { "data_digest" : "0x", "errors" : [ "omap_digest_mismatch_oi" ], "omap_digest" : "0x3242b04e", "osd" : 655, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x337cf025", "osd" : 990, "size" : 0 }, { "data_digest" : "0x", "errors" : [ "omap_digest_mismatch_oi" ], "omap_digest" : "0xc90d06a8", "osd" : 1322, "size" : 0 } ], "union_shard_errors" : [ "omap_digest_mismatch_oi" ] } ] } ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD's flapping on ordinary scrub with cluster being static (after upgrade to 12.1.1
Hi there, Somebody told me that essentially i was runnning pre released version and I should upgrade to 12.2 and comeback if problem persists. Today I’ve got upgrade available to 12.2 and installed it … so, I still have problems; Sep 08 18:48:05 proxmox1 ceph-osd[3954]: *** Caught signal (Segmentation fault) ** Sep 08 18:48:05 proxmox1 ceph-osd[3954]: in thread 7f5c883f0700 thread_name:tp_osd_tp Sep 08 18:48:05 proxmox1 ceph-osd[3954]: ceph version 12.2.0 (36f6c5ea099d43087ff0276121fd34e71668ae0e) luminous (rc) Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 1: (()+0xa07bb4) [0x55e157f22bb4] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 2: (()+0x110c0) [0x7f5ca94030c0] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 3: (()+0x1ff2f) [0x7f5caba05f2f] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 4: (rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions const&, rocksdb::BlockIter*, rocksdb::BlockBasedTable::CachableEntry*)+0x4e6) [0x55e158306bb6] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 5: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x283) [0x55e158307963] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 6: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, int)+0x13a) [0x55e1583e718a] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 7: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, rocksdb::RangeDelAggregator*, bool*, bool*, unsigned long*)+0x3f8) [0x55e1582c8c28] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 8: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, bool*)+0x552) [0x55e15838d682] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 9: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)+0x13) [0x55e15838dab3] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 10: (rocksdb::DB::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, std::__cxx11::basic_string, std::allocator >*)+0xc1) [0x55e157e6bb51] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 11: (RocksDBStore::get(std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, ceph::buffer::list*)+0x1bb) [0x55e157e6308b] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 12: (()+0x885d71) [0x55e157da0d71] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 13: (()+0x885675) [0x55e157da0675] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 14: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned int)+0x5d7) [0x55e157de48c7] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 15: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, unsigned long, std::set, std::allocator >*)+0x118) [0x55e157e05cd8] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 16: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr)+0xc5) [0x55e157e06755] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 17: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr&)+0x7b) [0x55e157e0807b] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 18: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1f55) [0x55e157e1ec15] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 19: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x536) [0x55e157e1f916] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 20: (PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)+0x66) [0x55e157b437f6] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 21: (ReplicatedBackend::do_repop(boost::intrusive_ptr)+0xbdc) [0x55e157c70d6c] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 22: (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2b7) [0x55e157c73b47] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 23: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) [0x55e157b810d0] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 24: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x4e3) [0x55e157ae6a83] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 25: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3ab) [0x55e15796b19b] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 26: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x5a) [0x55e157c0354a] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 27: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x103d) [0x55e157991d9d] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 28: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef) [0x55e157f6f20f] Sep 08 18:48:05 proxmox1 ceph-osd[3954]: 29: (Sharded
Re: [ceph-users] Significant uptick in inconsistent pgs in Jewel 10.2.9
On Thu, Sep 07, 2017 at 08:24:04PM +, Robin H. Johnson wrote: > pg 5.3d40 is active+clean+inconsistent, acting [1322,990,655] > pg 5.f1c0 is active+clean+inconsistent, acting [631,1327,91] Here is the output of 'rados list-inconsistent-obj' for the PGs: $ sudo rados list-inconsistent-obj 5.f1c0 |json_pp -json_opt canonical,pretty { "epoch" : 1221254, "inconsistents" : [ { "errors" : [ "omap_digest_mismatch" ], "object" : { "locator" : "", "name" : ".dir.default.292886573.13181.12", "nspace" : "", "snap" : "head", "version" : 483490 }, "selected_object_info" : "5:038f1cff:::.dir.default.292886573.13181.12:head(1221843'483490 client.417313345.0:19515832 dirty|omap|data_digest s 0 uv 483490 dd alloc_hint [0 0])", "shards" : [ { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x928b0c0b", "osd" : 91, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x928b0c0b", "osd" : 631, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x6556c868", "osd" : 1327, "size" : 0 } ], "union_shard_errors" : [] } ] } $ sudo rados list-inconsistent-obj 5.3d40 |json_pp -json_opt canonical,pretty { "epoch" : 1210895, "inconsistents" : [ { "errors" : [ "omap_digest_mismatch" ], "object" : { "locator" : "", "name" : ".dir.default.64449186.344176", "nspace" : "", "snap" : "head", "version" : 1177199 }, "selected_object_info" : "5:02bc4def:::.dir.default.64449186.344176:head(1177700'1180639 osd.1322.0:537914 dirty|omap|data_digest|omap_digest s 0 uv 1177199 dd od 337cf025 alloc_hint [0 0])", "shards" : [ { "data_digest" : "0x", "errors" : [ "omap_digest_mismatch_oi" ], "omap_digest" : "0x3242b04e", "osd" : 655, "size" : 0 }, { "data_digest" : "0x", "errors" : [], "omap_digest" : "0x337cf025", "osd" : 990, "size" : 0 }, { "data_digest" : "0x", "errors" : [ "omap_digest_mismatch_oi" ], "omap_digest" : "0xc90d06a8", "osd" : 1322, "size" : 0 } ], "union_shard_errors" : [ "omap_digest_mismatch_oi" ] } ] } -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Asst. Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: Digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-maintainers] Ceph release cadence
I think I'm the resident train release advocate so I'm sure my advocating that model will surprise nobody. I'm not sure I'd go all the way to Lars' multi-release maintenance model (although it's definitely something I'm interested in), but there are two big reasons I wish we were on a train with more frequent real releases: 1) It reduces the cost of features missing a release. Right now if something misses an LTS release, that's it for a year. And nobody likes releasing an LTS without a bunch of big new features, so each LTS is later than the one before as we scramble to get features merged in. ...and then we deal with the fact that we scrambled to get a bunch of features merged in and they weren't quite baked. (Luminous so far seems to have gone much better in this regard! Hurray! But I think that has a lot to do with our feature-release-scramble this year being mostly peripheral stuff around user interfaces that got tacked on about the time we'd initially planned the release to occur.) 2) Train releases increase predictability for downstreams, partners, and users around when releases will happen. Right now, the release process and schedule is entirely opaque to anybody who's not involved in every single upstream meeting we have; and it's unpredictable even to those who are. That makes things difficult, as Xiaoxi said. There are other peripheral but serious benefits I'd expect to see from fully-validated train releases as well. It would be *awesome* to have more frequent known-stable points to do new development against. If you're an external developer and you want a new feature, you have to either keep it rebased against a fast-changing master branch, or you need to settle for writing it against a long-out-of-date LTS and then forward-porting it for merge. If you're an FS developer writing a very small new OSD feature and you try to validate it against RADOS, you've no idea if bugs that pop up and look random are because you really did something wrong or if there's currently an intermittent issue in RADOS master. I would have *loved* to be able to maintain CephFS integration branches for features that didn't touch RADOS and were built on top of the latest release instead of master, but it was utterly infeasible because there were too many missing features with the long delays. On Fri, Sep 8, 2017 at 9:16 AM, Sage Weil wrote: > I'm going to pick on Lars a bit here... > > On Thu, 7 Sep 2017, Lars Marowsky-Bree wrote: >> On 2017-09-06T15:23:34, Sage Weil wrote: >> > Other options we should consider? Other thoughts? >> >> With about 20-odd years in software development, I've become a big >> believer in schedule-driven releases. If it's feature-based, you never >> know when they'll get done. >> >> If the schedule intervals are too long though, the urge to press too >> much in (so as not to miss the next merge window) is just too high, >> meaning the train gets derailed. (Which cascades into the future, >> because the next time the pressure will be even higher based on the >> previous experience.) This requires strictness. >> >> We've had a few Linux kernel releases that were effectively feature >> driven and never quite made it. 1.3.x? 1.5.x? My memory is bad, but they >> were a disaster than eventually led Linus to evolve to the current >> model. >> >> That serves them really well, and I believe it might be worth >> considering for us. > > This model is very appealing. The problem with it that I see is that the > upstream kernel community doesn't really do stable releases. Mainline > developers are just getting their stuff upstream, and entire separate > organizations and teams are doing the stable distro kernels. (There are > upstream stable kernels too, yes, but they don't get much testing AFAICS > and I'm not sure who uses them.) > > More importantly, upgrade and on-disk format issues are present for almost > everything that we change in Ceph. Those things rarely come up for the > kernel. Even the local file systems (a small piece of the kernel) have > comparatively fewer format changes that we do, it seems. > > These make the upgrade testing a huge concern and burden for the > Ceph development community. > >> I'd try to move away from the major milestones. Features get integrated >> into the next schedule-driven release when they deemed ready and stable; >> when they're not, not a big deal, the next one is coming up "soonish". >> >> (This effectively decouples feature development slightly from the >> release schedule.) >> >> We could even go for "a release every 3 months, sharp", merge window for >> the first month, stabilization the second, release clean up the third, >> ship. >> >> Interoperability hacks for the cluster/server side are maintained for 2 >> years, and then dropped. Sharp. (Speaking as one of those folks >> affected, we should not burden the community with this.) Client interop >> is a different story, a bit. >> >> Basically, effectively edging towards continuous integration
Re: [ceph-users] [Ceph-maintainers] Ceph release cadence
I'm going to pick on Lars a bit here... On Thu, 7 Sep 2017, Lars Marowsky-Bree wrote: > On 2017-09-06T15:23:34, Sage Weil wrote: > > Other options we should consider? Other thoughts? > > With about 20-odd years in software development, I've become a big > believer in schedule-driven releases. If it's feature-based, you never > know when they'll get done. > > If the schedule intervals are too long though, the urge to press too > much in (so as not to miss the next merge window) is just too high, > meaning the train gets derailed. (Which cascades into the future, > because the next time the pressure will be even higher based on the > previous experience.) This requires strictness. > > We've had a few Linux kernel releases that were effectively feature > driven and never quite made it. 1.3.x? 1.5.x? My memory is bad, but they > were a disaster than eventually led Linus to evolve to the current > model. > > That serves them really well, and I believe it might be worth > considering for us. This model is very appealing. The problem with it that I see is that the upstream kernel community doesn't really do stable releases. Mainline developers are just getting their stuff upstream, and entire separate organizations and teams are doing the stable distro kernels. (There are upstream stable kernels too, yes, but they don't get much testing AFAICS and I'm not sure who uses them.) More importantly, upgrade and on-disk format issues are present for almost everything that we change in Ceph. Those things rarely come up for the kernel. Even the local file systems (a small piece of the kernel) have comparatively fewer format changes that we do, it seems. These make the upgrade testing a huge concern and burden for the Ceph development community. > I'd try to move away from the major milestones. Features get integrated > into the next schedule-driven release when they deemed ready and stable; > when they're not, not a big deal, the next one is coming up "soonish". > > (This effectively decouples feature development slightly from the > release schedule.) > > We could even go for "a release every 3 months, sharp", merge window for > the first month, stabilization the second, release clean up the third, > ship. > > Interoperability hacks for the cluster/server side are maintained for 2 > years, and then dropped. Sharp. (Speaking as one of those folks > affected, we should not burden the community with this.) Client interop > is a different story, a bit. > > Basically, effectively edging towards continuous integration of features > and bugfixes both. Nobody has to wait for anything much, and can > schedule reasonably independently. If I read between the lines a bit here, but this sounds like is: - keep the frequently major releases (but possibly shorten the 6mo cadence) - do backports for all of them, not just the even ones - test upgrades between all of them within a 2 year horizon, instead of just the last major one Is that accurate? Unfortunately it sounds to me like that would significantly increase the maintenance burden (double it even?) and slow development down. The user base will also end up fragmented across a broader range of versions, which means we'll see a wider variety of bugs and each release will be less stable. This is full of trade-offs... time we spend backporting or testing upgrades is time we don't spend fixing bugs or improving performance or adding features. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph release cadence
Personally I kind of like the current format and fundamentally we are talking about Data storage which should be the most tested and scrutinized piece of software on your computer. Having XYZ feature later than sooner compared to oh I lost all my data. I am thinking of a recent FS that had a feature they shouldn't have released. I appreciate the extra time it takes to release to make it resilient. Having a LTS version to rely on provides a good assurance the upgrade process wil be thoroughly tested. Having a version to do more experimental features keeps the new features at bay, it follows the Ubuntu model basically. I feel there were a lot of underpinning features in Luminous that checkmarked a lot of checkboxes you have been wanting for a while. One thing to consider possibly a lot of the core features become more incremental. I guess from my use case Ceph actually does everything I need it to do atm. Yes new features and better processes make it better, but more or less I am pretty content. Maybe I am a small minority in this logic. On Fri, Sep 8, 2017 at 2:20 AM Matthew Vernon wrote: > Hi, > > On 06/09/17 16:23, Sage Weil wrote: > > > Traditionally, we have done a major named "stable" release twice a year, > > and every other such release has been an "LTS" release, with fixes > > backported for 1-2 years. > > We use the ceph version that comes with our distribution (Ubuntu LTS); > those come out every 2 years (though we won't move to a brand-new > distribution until we've done some testing!). So from my POV, LTS ceph > releases that come out such that adjacent ceph LTSs fit neatly into > adjacent Ubuntu LTSs is the ideal outcome. We're unlikely to ever try > putting a non-LTS ceph version into production. > > I hope this isn't an unusual requirement :) > > Matthew > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Client features by IP?
On 09/07/2017 01:26 PM, Josh Durgin wrote: > On 09/07/2017 11:31 AM, Bryan Stillwell wrote: >> On 09/07/2017 10:47 AM, Josh Durgin wrote: >>> On 09/06/2017 04:36 PM, Bryan Stillwell wrote: I was reading this post by Josh Durgin today and was pretty happy to see we can get a summary of features that clients are using with the 'ceph features' command: http://ceph.com/community/new-luminous-upgrade-complete/ However, I haven't found an option to display the IP address of those clients with the older feature sets. Is there a flag I can pass to 'ceph features' to list the IPs associated with each feature set? >>> >>> There is not currently, we should add that - it'll be easy to backport >>> to luminous too. The only place both features and IP are shown is in >>> 'debug mon = 10' logs right now. >> >> I think that would be great! The first thing I would want to do after >> seeing an old client listed would be to find it and upgrade it. Having >> the IP of the client would make that a ton easier! > > Yup, should've included that in the first place! > >> Anything I could do to help make that happen? File a feature request >> maybe? > > Sure, adding a short tracker.ceph.com ticket would help, that way we can > track the backport easily too. Ticket created: http://tracker.ceph.com/issues/21315 Thanks Josh! Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
Sorry, I dind't see that you use proxmox5. As I'm a proxmox contributor, I can tell you that I have error with kernel 4.10 (which is ubuntu kernel). if you don't use zfs, try kernel 4.12 from stretch-backports, or kernel 4.4 from proxmox 4 (with zfs support). Tell me if it's works better for you. (I'm currently try to backports last mlx5 patches from kernel 4.12 to kernel 4.10, to see if it's helping) I have open a thread on pve-devel mailing list today. - Mail original - De: "Alexandre Derumier" À: "Burkhard Linke" Cc: "ceph-users" Envoyé: Vendredi 8 Septembre 2017 17:27:49 Objet: Re: [ceph-users] output discards (queue drops) on switchport Hi, >> public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s which kernel/distro do you use ? I have same card, and I had problem with centos7 kernel 3.10 recently, with packet drop i have also problems with ubuntu kernel 4.10 and lacp kernel 4.4 or 4.12 are working fine for me. - Mail original - De: "Burkhard Linke" À: "ceph-users" Envoyé: Vendredi 8 Septembre 2017 16:25:31 Objet: Re: [ceph-users] output discards (queue drops) on switchport Hi, On 09/08/2017 04:13 PM, Andreas Herrmann wrote: > Hi, > > On 08.09.2017 15:59, Burkhard Linke wrote: >> On 09/08/2017 02:12 PM, Marc Roos wrote: >>> >>> Afaik ceph is is not supporting/working with bonding. >>> >>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html >>> (thread: Maybe some tuning for bonded network adapters) >> CEPH works well with LACP bonds. The problem described in that thread is the >> fact that LACP is not using links in a round robin fashion, but distributes >> network stream depending on a hash of certain parameters like source and >> destination IP address. This is already set to layer3+4 policy by the OP. >> >> Regarding the drops (and without any experience with neither 25GBit ethernet >> nor the Arista switches): >> Do you have corresponding input drops on the server's network ports? > No input drops, just output drop Output drops on the switch are related to input drops on the server side. If the link uses flow control and the server signals the switch that its internal buffer are full, the switch has to drop further packages if the port buffer is also filled. If there's no flow control, and the network card is not able to store the packet (full buffers...), it should be noted as overrun in the interface statistics (and if this is not correct, please correct me, I'm not a network guy). > >> Did you tune the network settings on server side for high throughput, e.g. >> net.ipv4.tcp_rmem, wmem, ...? > sysctl tuning is disabled at the moment. I tried sysctl examples from > https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is > still > the same amount of output drops. > >> And are the CPUs fast enough to handle the network traffic? > Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's > my first Ceph cluster. The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid controller and 8 ssd based osds with it. You can use tools like atop or ntop to watch certain aspects of the system during the tests (network, cpu, disk). Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
Hi, >> public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s which kernel/distro do you use ? I have same card, and I had problem with centos7 kernel 3.10 recently, with packet drop i have also problems with ubuntu kernel 4.10 and lacp kernel 4.4 or 4.12 are working fine for me. - Mail original - De: "Burkhard Linke" À: "ceph-users" Envoyé: Vendredi 8 Septembre 2017 16:25:31 Objet: Re: [ceph-users] output discards (queue drops) on switchport Hi, On 09/08/2017 04:13 PM, Andreas Herrmann wrote: > Hi, > > On 08.09.2017 15:59, Burkhard Linke wrote: >> On 09/08/2017 02:12 PM, Marc Roos wrote: >>> >>> Afaik ceph is is not supporting/working with bonding. >>> >>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html >>> (thread: Maybe some tuning for bonded network adapters) >> CEPH works well with LACP bonds. The problem described in that thread is the >> fact that LACP is not using links in a round robin fashion, but distributes >> network stream depending on a hash of certain parameters like source and >> destination IP address. This is already set to layer3+4 policy by the OP. >> >> Regarding the drops (and without any experience with neither 25GBit ethernet >> nor the Arista switches): >> Do you have corresponding input drops on the server's network ports? > No input drops, just output drop Output drops on the switch are related to input drops on the server side. If the link uses flow control and the server signals the switch that its internal buffer are full, the switch has to drop further packages if the port buffer is also filled. If there's no flow control, and the network card is not able to store the packet (full buffers...), it should be noted as overrun in the interface statistics (and if this is not correct, please correct me, I'm not a network guy). > >> Did you tune the network settings on server side for high throughput, e.g. >> net.ipv4.tcp_rmem, wmem, ...? > sysctl tuning is disabled at the moment. I tried sysctl examples from > https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is > still > the same amount of output drops. > >> And are the CPUs fast enough to handle the network traffic? > Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's > my first Ceph cluster. The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid controller and 8 ssd based osds with it. You can use tools like atop or ntop to watch certain aspects of the system during the tests (network, cpu, disk). Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw crashing after buffer overflows detected
For about a week we've been seeing a decent number of buffer overflows detected across all our RGW nodes in one of our clusters. This started happening a day after we started weighing in some new OSD nodes, so we're thinking it's probably related to that. Could someone help us determine the root cause of this? Cluster details: Distro: CentOS 7.2 Release: 0.94.10-0.el7.x86_64 OSDs: 1120 RGW nodes: 10 See log messages below. If you know how to improve the call trace below I would like to hear that too. I tried installing the ceph-debuginfo-0.94.10-0.el7.x86_64 package, but that didn't seem to help. Thanks, Bryan # From /var/log/messages: Sep 7 20:06:11 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 21:01:55 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 21:37:00 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 23:14:54 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 7 23:17:08 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 00:12:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:04:07 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:17:49 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:41:39 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated Sep 8 07:59:29 p3cephrgw003 radosgw: *** buffer overflow detected ***: /bin/radosgw terminated # From /var/log/ceph/client.radosgw.p3cephrgw003.log: 0> 2017-09-08 07:59:29.696615 7f7b296a2700 -1 *** Caught signal (Aborted) ** in thread 7f7b296a2700 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af) 1: /bin/radosgw() [0x6d3d92] 2: (()+0xf100) [0x7f7f425e9100] 3: (gsignal()+0x37) [0x7f7f4141d5f7] 4: (abort()+0x148) [0x7f7f4141ece8] 5: (()+0x75317) [0x7f7f4145d317] 6: (__fortify_fail()+0x37) [0x7f7f414f5ac7] 7: (()+0x10bc80) [0x7f7f414f3c80] 8: (()+0x10da37) [0x7f7f414f5a37] 9: (OS_Accept()+0xc1) [0x7f7f435bd8b1] 10: (FCGX_Accept_r()+0x9c) [0x7f7f435bb91c] 11: (RGWFCGXProcess::run()+0x7bf) [0x58136f] 12: (RGWProcessControlThread::entry()+0xe) [0x5821fe] 13: (()+0x7dc5) [0x7f7f425e1dc5] 14: (clone()+0x6d) [0x7f7f414de21d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [PVE-User] OSD won't start, even created ??
Hi, any help would be really useful. Does anyone got a clue with my issue ? Thanks by advance. Best regards; Le 05/09/2017 à 20:25, Phil Schwarz a écrit : > Hi, > I come back with same issue as seen in previous thread ( link given) > > trying to a 2TB SATA as OSD: > Using proxmox GUI or CLI (command given) give the same (bad) result. > > Didn't want to use a direct 'ceph osd create', thus bypassing pxmfs > redundant filesystem. > > I tried to build an OSD woth same disk on another machine (stronger one > with Opteron QuadCore), failing at the same time. > > > Sorry for crossposting, but i think, i fail against the pveceph wrapper. > > > Any help or clue would be really useful.. > > Thanks > Best regards. > > > > > > > > > > > -- Link to previous thread (but same problem): > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg38897.html > > > -- commands : > fdisk /dev/sdc ( mklabel msdos, w, q) > ceph-disk zap /dev/sdc > pveceph createosd /dev/sdc > > -- dpkg -l > > dpkg -l |grep ceph > ii ceph 12.1.2-pve1 amd64 > distributed storage and file system > ii ceph-base12.1.2-pve1 amd64common > ceph daemon libraries and management tools > ii ceph-common 12.1.2-pve1 amd64common > utilities to mount and interact with a ceph storage cluster > ii ceph-mgr 12.1.2-pve1 amd64 > manager for the ceph distributed storage system > ii ceph-mon 12.1.2-pve1 amd64 > monitor server for the ceph storage system > ii ceph-osd 12.1.2-pve1 amd64OSD > server for the ceph storage system > ii libcephfs1 10.2.5-7.2 amd64Ceph > distributed file system client library > ii libcephfs2 12.1.2-pve1 amd64Ceph > distributed file system client library > ii python-cephfs12.1.2-pve1 amd64Python > 2 libraries for the Ceph libcephfs library > > -- tail -f /var/log/ceph/ceph-osd.admin.log > > 2017-09-03 18:28:20.856641 7fad97e45e00 0 ceph version 12.1.2 > (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc), process > (unknown), pid 5493 > 2017-09-03 18:28:20.857104 7fad97e45e00 -1 bluestore(/dev/sdc2) > _read_bdev_label unable to decode label at offset 102: > buffer::malformed_input: void > bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode > past end of struct encoding > 2017-09-03 18:28:20.857200 7fad97e45e00 1 journal _open /dev/sdc2 fd 4: > 2000293007360 bytes, block size 4096 bytes, directio = 0, aio = 0 > 2017-09-03 18:28:20.857366 7fad97e45e00 1 journal close /dev/sdc2 > 2017-09-03 18:28:20.857431 7fad97e45e00 0 probe_block_device_fsid > /dev/sdc2 is filestore, ---- > 2017-09-03 18:28:21.937285 7fa5766a5e00 0 ceph version 12.1.2 > (cd7bc3b11cdbe6fa94324b7322fb2a4716a052a7) luminous (rc), process > (unknown), pid 5590 > 2017-09-03 18:28:21.944189 7fa5766a5e00 -1 bluestore(/dev/sdc2) > _read_bdev_label unable to decode label at offset 102: > buffer::malformed_input: void > bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode > past end of struct encoding > 2017-09-03 18:28:21.944305 7fa5766a5e00 1 journal _open /dev/sdc2 fd 4: > 2000293007360 bytes, block size 4096 bytes, directio = 0, aio = 0 > 2017-09-03 18:28:21.944527 7fa5766a5e00 1 journal close /dev/sdc2 > 2017-09-03 18:28:21.944588 7fa5766a5e00 0 probe_block_device_fsid > /dev/sdc2 is filestore, ---- > ___ > pve-user mailing list > pve-u...@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
Hi, On 09/08/2017 04:13 PM, Andreas Herrmann wrote: Hi, On 08.09.2017 15:59, Burkhard Linke wrote: On 09/08/2017 02:12 PM, Marc Roos wrote: Afaik ceph is is not supporting/working with bonding. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html (thread: Maybe some tuning for bonded network adapters) CEPH works well with LACP bonds. The problem described in that thread is the fact that LACP is not using links in a round robin fashion, but distributes network stream depending on a hash of certain parameters like source and destination IP address. This is already set to layer3+4 policy by the OP. Regarding the drops (and without any experience with neither 25GBit ethernet nor the Arista switches): Do you have corresponding input drops on the server's network ports? No input drops, just output drop Output drops on the switch are related to input drops on the server side. If the link uses flow control and the server signals the switch that its internal buffer are full, the switch has to drop further packages if the port buffer is also filled. If there's no flow control, and the network card is not able to store the packet (full buffers...), it should be noted as overrun in the interface statistics (and if this is not correct, please correct me, I'm not a network guy). Did you tune the network settings on server side for high throughput, e.g. net.ipv4.tcp_rmem, wmem, ...? sysctl tuning is disabled at the moment. I tried sysctl examples from https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is still the same amount of output drops. And are the CPUs fast enough to handle the network traffic? Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's my first Ceph cluster. The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid controller and 8 ssd based osds with it. You can use tools like atop or ntop to watch certain aspects of the system during the tests (network, cpu, disk). Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
Hi, On 08.09.2017 15:59, Burkhard Linke wrote: > On 09/08/2017 02:12 PM, Marc Roos wrote: >> >> Afaik ceph is is not supporting/working with bonding. >> >> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html >> (thread: Maybe some tuning for bonded network adapters) > > CEPH works well with LACP bonds. The problem described in that thread is the > fact that LACP is not using links in a round robin fashion, but distributes > network stream depending on a hash of certain parameters like source and > destination IP address. This is already set to layer3+4 policy by the OP. > > Regarding the drops (and without any experience with neither 25GBit ethernet > nor the Arista switches): > Do you have corresponding input drops on the server's network ports? No input drops, just output drop > Did you tune the network settings on server side for high throughput, e.g. > net.ipv4.tcp_rmem, wmem, ...? sysctl tuning is disabled at the moment. I tried sysctl examples from https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is still the same amount of output drops. > And are the CPUs fast enough to handle the network traffic? Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's my first Ceph cluster. Later I'll upgrade from 12.1.2 => 12.2.0 and will some more test. Regards, Andreas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
Hi, On 09/08/2017 02:12 PM, Marc Roos wrote: Afaik ceph is is not supporting/working with bonding. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html (thread: Maybe some tuning for bonded network adapters) CEPH works well with LACP bonds. The problem described in that thread is the fact that LACP is not using links in a round robin fashion, but distributes network stream depending on a hash of certain parameters like source and destination IP address. This is already set to layer3+4 policy by the OP. Regarding the drops (and without any experience with neither 25GBit ethernet nor the Arista switches): Do you have corresponding input drops on the server's network ports? Did you tune the network settings on server side for high throughput, e.g. net.ipv4.tcp_rmem, wmem, ...? And are the CPUs fast enough to handle the network traffic? Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
I disabled the complete bond and just used a single 25 GBit/s link. The output drops still appear on the switchports. On 08.09.2017 14:12, Marc Roos wrote: > > > Afaik ceph is is not supporting/working with bonding. > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html > (thread: Maybe some tuning for bonded network adapters) > > > > > -Original Message- > From: Andreas Herrmann [mailto:andr...@mx20.org] > Sent: vrijdag 8 september 2017 13:58 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] output discards (queue drops) on switchport > > Hello, > > I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, > Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB > connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph > (12.1.2) > > The servers are connected to two Arista DCS-7060CX-32S switches. I'm > using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000): > * backend network for Ceph: cluster network & public network >Mellanox ConnectX-4 Lx dual-port 25 GBit/s > * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > dual-port > > Ceph is quite a default installation with size=3. > > My problem: > I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in > a test virtual machine (the only one running in the cluster) with > arround 210 MB/s. I get output drops on all switchports. The drop rate > is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing > with about 1300MB/s into ceph. > > First I thought about a problem with the Mellanox cards and used the > Intel cards for ceph traffic. The problem also exists. > > I tried quite a lot and nothing help: > * changed the MTU from 9000 to 1500 > * changed bond_xmit_hash_policy from layer3+4 to layer2+3 > * deactivated the bond and just used a single link > * disabled offloading > * disabled power management in BIOS > * perf-bias 0 > > I analyzed the traffic via tcpdump and got some of those "errors": > * TCP Previous segment not captured > * TCP Out-of-Order > * TCP Retransmission > * TCP Fast Retransmission > * TCP Dup ACK > * TCP ACKed unseen segment > * TCP Window Update > > Is that behaviour normal for ceph or has anyone ideas how to solve that > problem with the output drops at switch-side > > With iperf I can reach full 50 GBit/s on the bond with zero output > drops. > > Andreas > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph OSD journal (with dmcrypt) replacement
You used to need a keyfile in Hammer. I think Jewel changed that to a partition, but I don't have experience with that. On Fri, Sep 8, 2017, 9:18 AM M Ranga Swami Reddy wrote: > when I create dmcrypted jounral using cryptsetup command, its asking > for passphase? Can I use passphase as empty? > > On Wed, Sep 6, 2017 at 11:23 PM, M Ranga Swami Reddy > wrote: > > Thank you. Iam able to replace the dmcrypt journal successfully. > > > > On Sep 5, 2017 18:14, "David Turner" wrote: > >> > >> Did the journal drive fail during operation? Or was it taken out during > >> pre-failure. If it fully failed, then most likely you can't guarantee > the > >> consistency of the underlying osds. In this case, you just put the > affected > >> osds and add them back in as new osds. > >> > >> In the case of having good data on the osds, you follow the standard > >> process of closing the journal, create the new partition, set up all of > the > >> partition metadata so that the ceph udev rules will know what the > journal > >> is, and just create a new dmcrypt volume on it. I would recommend using > the > >> same uuid as the old journal so that you don't need to update the > symlinks > >> and such on the osd. After everything is done, run the journal create > >> command for the osd and start the osd. > >> > >> > >> On Tue, Sep 5, 2017, 2:47 AM M Ranga Swami Reddy > >> wrote: > >>> > >>> Hello, > >>> How to replace an OSD's journal created with dmcrypt, from one drive > >>> to another drive, in case of current journal drive failed. > >>> > >>> Thanks > >>> Swami > >>> ___ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph OSD journal (with dmcrypt) replacement
when I create dmcrypted jounral using cryptsetup command, its asking for passphase? Can I use passphase as empty? On Wed, Sep 6, 2017 at 11:23 PM, M Ranga Swami Reddy wrote: > Thank you. Iam able to replace the dmcrypt journal successfully. > > On Sep 5, 2017 18:14, "David Turner" wrote: >> >> Did the journal drive fail during operation? Or was it taken out during >> pre-failure. If it fully failed, then most likely you can't guarantee the >> consistency of the underlying osds. In this case, you just put the affected >> osds and add them back in as new osds. >> >> In the case of having good data on the osds, you follow the standard >> process of closing the journal, create the new partition, set up all of the >> partition metadata so that the ceph udev rules will know what the journal >> is, and just create a new dmcrypt volume on it. I would recommend using the >> same uuid as the old journal so that you don't need to update the symlinks >> and such on the osd. After everything is done, run the journal create >> command for the osd and start the osd. >> >> >> On Tue, Sep 5, 2017, 2:47 AM M Ranga Swami Reddy >> wrote: >>> >>> Hello, >>> How to replace an OSD's journal created with dmcrypt, from one drive >>> to another drive, in case of current journal drive failed. >>> >>> Thanks >>> Swami >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] output discards (queue drops) on switchport
Afaik ceph is is not supporting/working with bonding. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html (thread: Maybe some tuning for bonded network adapters) -Original Message- From: Andreas Herrmann [mailto:andr...@mx20.org] Sent: vrijdag 8 september 2017 13:58 To: ceph-users@lists.ceph.com Subject: [ceph-users] output discards (queue drops) on switchport Hello, I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph (12.1.2) The servers are connected to two Arista DCS-7060CX-32S switches. I'm using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000): * backend network for Ceph: cluster network & public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ dual-port Ceph is quite a default installation with size=3. My problem: I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test virtual machine (the only one running in the cluster) with arround 210 MB/s. I get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph. First I thought about a problem with the Mellanox cards and used the Intel cards for ceph traffic. The problem also exists. I tried quite a lot and nothing help: * changed the MTU from 9000 to 1500 * changed bond_xmit_hash_policy from layer3+4 to layer2+3 * deactivated the bond and just used a single link * disabled offloading * disabled power management in BIOS * perf-bias 0 I analyzed the traffic via tcpdump and got some of those "errors": * TCP Previous segment not captured * TCP Out-of-Order * TCP Retransmission * TCP Fast Retransmission * TCP Dup ACK * TCP ACKed unseen segment * TCP Window Update Is that behaviour normal for ceph or has anyone ideas how to solve that problem with the output drops at switch-side With iperf I can reach full 50 GBit/s on the bond with zero output drops. Andreas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] output discards (queue drops) on switchport
Hello, I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, Xeon E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to a LSI-9300-8i (SAS3008) controller used as OSDs for Ceph (12.1.2) The servers are connected to two Arista DCS-7060CX-32S switches. I'm using MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000): * backend network for Ceph: cluster network & public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ dual-port Ceph is quite a default installation with size=3. My problem: I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in a test virtual machine (the only one running in the cluster) with arround 210 MB/s. I get output drops on all switchports. The drop rate is between 0.1 - 0.9 %. The drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph. First I thought about a problem with the Mellanox cards and used the Intel cards for ceph traffic. The problem also exists. I tried quite a lot and nothing help: * changed the MTU from 9000 to 1500 * changed bond_xmit_hash_policy from layer3+4 to layer2+3 * deactivated the bond and just used a single link * disabled offloading * disabled power management in BIOS * perf-bias 0 I analyzed the traffic via tcpdump and got some of those "errors": * TCP Previous segment not captured * TCP Out-of-Order * TCP Retransmission * TCP Fast Retransmission * TCP Dup ACK * TCP ACKed unseen segment * TCP Window Update Is that behaviour normal for ceph or has anyone ideas how to solve that problem with the output drops at switch-side With iperf I can reach full 50 GBit/s on the bond with zero output drops. Andreas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore "separate" WAL and DB
Hi, Reading the ceph-users list I'm obviously seeing a lot of people talking about using bluestore now that Luminous has been released. I note that many users seem to be under the impression that they need separate block devices for the bluestore data block, the DB, and the WAL... even when they are going to put the DB and the WAL on the same device! As per the docs at http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ this is nonsense: > If there is only a small amount of fast storage available (e.g., less than a > gigabyte), we recommend using it as a WAL device. If there is more, > provisioning a DB > device makes more sense. The BlueStore journal will always be placed on the > fastest device available, so using a DB device will provide the same benefit > that the WAL > device would while also allowing additional metadata to be stored there (if > it will fix). [sic, I assume that should be "fit"] I understand that if you've got three speeds of storage available, there may be some sense to dividing these. For instance, if you've got lots of HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD, DB on SSD and WAL on NVMe may be a sensible division of data. That's not the case for most of the examples I'm reading; they're talking about putting DB and WAL on the same block device, but in different partitions. There's even one example of someone suggesting to try partitioning a single SSD to put data/DB/WAL all in separate partitions! Are the docs wrong and/or I am missing something about optimal bluestore setup, or do people simply have the wrong end of the stick? I ask because I'm just going through switching all my OSDs over to Bluestore now and I've just been reusing the partitions I set up for journals on my SSDs as DB devices for Bluestore HDDs without specifying anything to do with the WAL, and I'd like to know sooner rather than later if I'm making some sort of horrible mistake. Rich -- Richard Hesketh signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph release cadence
Hi, On 06/09/17 16:23, Sage Weil wrote: > Traditionally, we have done a major named "stable" release twice a year, > and every other such release has been an "LTS" release, with fixes > backported for 1-2 years. We use the ceph version that comes with our distribution (Ubuntu LTS); those come out every 2 years (though we won't move to a brand-new distribution until we've done some testing!). So from my POV, LTS ceph releases that come out such that adjacent ceph LTSs fit neatly into adjacent Ubuntu LTSs is the ideal outcome. We're unlikely to ever try putting a non-LTS ceph version into production. I hope this isn't an unusual requirement :) Matthew -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD: How many snapshots is too many?
> In our use case, we are severly hampered by the size of removed_snaps > (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in > PGPool::update and its interval calculation code. We have a cluster of > around 100k RBDs with each RBD having upto 25 snapshots and only a small > portion of our RBDs mapped at a time (~500-1000). For size / performance > reasons we try to keep the number of snapshots low (<25) and need to > prune snapshots. Since in our use case RBDs 'age' at different rates, > snapshot pruning creates holes to the point where we the size of the > removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph > clusters. I think in general around 2 snapshot removal operations > currently happen a minute just because of the volume of snapshots and > users we have. Right. Greg, this is what I was getting at: 25 snapshots per RBD is firmly in "one snapshot per day per RBD" territory — this is something that a cloud operator might do, for example, offering daily snapshots going back one month. But it still wrecks the cluster simply by having lots of images (even though only a fraction of them, less than 1%, are ever in use). That's rather counter-intuitive, it doesn't hit you until you have lots of images, and once you're affected by it there's no practical way out — where "out" is defined as "restoring overall cluster performance to something acceptable". > We found the PGPool::update and the interval calculation code code to be > quite inefficient. Some small changes made it a lot faster giving more > breathing room, we shared and these and most already got applied: > https://github.com/ceph/ceph/pull/17088 > https://github.com/ceph/ceph/pull/17121 > https://github.com/ceph/ceph/pull/17239 > https://github.com/ceph/ceph/pull/17265 > https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) > > However for our use case these patches helped, but overall CPU usage in > this area is still high (>70% or so), making the Ceph cluster slow > causing blocked requests and many operations (e.g. rbd map) to take a > long time. I think this makes this very much a practical issue, not a hypothetical/theoretical one. > We are trying to work around these issues by trying to change our > snapshot strategy. In the short-term we are manually defragmenting the > interval set by scanning for holes and trying to delete snapids in > between holes to coalesce more holes. This is not so nice to do. In some > cases we employ strategies to 'recreate' old snapshots (as we need to > keep them) at higher snapids. For our use case a 'snapid rename' feature > would have been quite helpful. > > I hope this shines some light on practical Ceph clusters in which > performance is bottlenecked not by I/O but by snapshot removal. For others following this thread or retrieving it from the list archive some time down the road, I'd rephrase that as "bottlenecked not by I/O but by CPU utilization associated with snapshot removal". Is that fair to say, Patrick? Please correct me if I'm misrepresenting. Greg (or Josh/Jason/Sage/anyone really :) ), can you provide additional insight as to how these issues can be worked around or mitigated, besides the PRs that Patrick and his colleagues have already sent? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs(Kraken 11.2.1), Unable to write more file when one dir more than 100000 files, mds_bal_fragment_size_max = 5000000
> On 8 Sep 2017, at 13:54, donglifec...@gmail.com wrote: > > ZhengYan, > > I'm sorry, just a description of some questions. > > when one dir more than 10 files, I can continue to write it , but I don't > find file which wrote in the past. for example: > 1. I write 10 files named 512k.file$i > > 2. I continue to write 1 files named aaa.file$i > > 3. I continue to write 1 files named bbb.file$i > > 4. I continue to write 1 files named ccc.file$i > > 5. I continue to write 1 files named ddd.file$i > > 6. I can't find all ddd.file$i, some ddd.file$i missing. such as: > > [root@yj43959-ceph-dev scripts]# find /mnt/cephfs/volumes -type f | grep > 512k.file | wc -l > 10 > [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/aaa.file* | wc -l > 1 > [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/bbb.file* | wc -l > 1 > [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/ccc.file* | wc -l > 1 > [root@yj43959-ceph-dev scripts]# ls /mnt/cephfs/volumes/ddd.file* | wc -l > // some files missing > 1072 It’s likely caused by http://tracker.ceph.com/issues/18314. To support very large directory, you should enable directory fragment instead of enlarge mds_bal_fragment_size_max. Regards Yan, Zheng > > > > donglifec...@gmail.com > > From: donglifec...@gmail.com > Date: 2017-09-08 13:30 > To: zyan > CC: ceph-users > Subject: [ceph-users]cephfs(Kraken 11.2.1), Unable to write more file when > one dir more than 10 files, mds_bal_fragment_size_max = 500 > ZhengYan, > > I test cephfs(Kraken 11.2.1), I can't write more files when one dir more > than 10 files, I have already set up "mds_bal_fragment_size_max = > 500". > > why is this case? Is it a bug? > > Thanks a lot. > > donglifec...@gmail.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com