Re: [ceph-users] pg 17.36 is active+clean+inconsistent head expected clone 1 missing?
Looks similar to a problem I had after a several OSDs crashed while trimming snapshots. In my case, the primary OSD thought the snapshot was gone, but some of the replicas are still there, so scrubbing flags it. First I purged all snapshots and then ran ceph pg repair on the problematic placement groups. The first time I encountered this, that action was sufficient to repair the problem. The second time however, I ended up having to manually remove the snapshot objects. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027431.html Once I had done that, repair the placement group fixed the issue. -Steve On 11/16/2018 04:00 AM, Marc Roos wrote: > > > I am not sure that is going to work, because I have this error quite > some time, from before I added the 4th node. And on the 3 node cluster > it was: > > osdmap e18970 pg 17.36 (17.36) -> up [9,0,12] acting [9,0,12] > > If I understand correctly what you intent to do, moving the data around. > This was sort of accomplished by adding the 4th node. > > > > -Original Message- > From: Frank Yu [mailto:flyxia...@gmail.com] > Sent: vrijdag 16 november 2018 3:51 > To: Marc Roos > Cc: ceph-users > Subject: Re: [ceph-users] pg 17.36 is active+clean+inconsistent head > expected clone 1 missing? > > try to restart osd.29, then use pg repair. If this doesn't work or it > appear again after a while, scan your HDD which used for osd.29, maybe > there is bad sector of your disks, just replace the disk with new one. > > > > On Thu, Nov 15, 2018 at 5:00 PM Marc Roos > wrote: > > > > Forgot, these are bluestore osds > > > > -Original Message- > From: Marc Roos > Sent: donderdag 15 november 2018 9:59 > To: ceph-users > Subject: [ceph-users] pg 17.36 is active+clean+inconsistent head > expected clone 1 missing? > > > > I thought I will give it another try, asking again here since there > is > another thread current. I am having this error since a year or so. > > This I of course already tried: > ceph pg deep-scrub 17.36 > ceph pg repair 17.36 > > > [@c01 ~]# rados list-inconsistent-obj 17.36 > {"epoch":24363,"inconsistents":[]} > > > [@c01 ~]# ceph pg map 17.36 > osdmap e24380 pg 17.36 (17.36) -> up [29,12,6] acting [29,12,6] > > > [@c04 ceph]# zgrep ERR ceph-osd.29.log*gz > ceph-osd.29.log-20181114.gz:2018-11-13 14:19:55.766604 7f25a05b1700 > -1 > log_channel(cluster) log [ERR] : deep-scrub 17.36 > 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:head > expected > clone 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4 1 > missing > ceph-osd.29.log-20181114.gz:2018-11-13 14:24:55.943454 7f25a05b1700 > -1 > log_channel(cluster) log [ERR] : 17.36 deep-scrub 1 errors > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- Steve Anthony LTS HPC Senior Analyst Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] unable to remove phantom snapshot for object, snapset_inconsistency
In case anyone else runs into this, I resolved using removeall on both bad OSDs and running ceph pg repair, which copied the good object back. -Steve On 06/27/2018 06:17 PM, Steve Anthony wrote: In the process of trying to repair snapshot inconsistencies associated with the issues in this thread, http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027125.html ("FAILED assert(p != recovery_info.ss.clone_snaps.end())"), I have one PG I still can't get to repair. Two of the three replicas appear to have (or think they have) a snapshot. However, neither ceph-objectstore-tool list operation nor running find on the OSD fuse mounted report or find the snaps. # ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-313/ --pgid 2.13e --op list rb.0.2479b45.238e1f29.00125cbb ["2.13e",{"oid":"rb.0.2479b45.238e1f29.00125cbb","key":"","snapid":-2,"hash":2016338238,"max":0,"pool":2,"namespace":"","max":0}] The ceph-objectstore tool remove-clone-metadata operation also reports the snapshot does not exist. # ceph-objectstore-tool --dry-run --type bluestore --data-path /var/lib/ceph/osd/ceph-313/ --pgid 2.13e '{"oid":"rb.0.2479b45.238e1f29.00125cbb","key":"","snapid":-2,"hash":2016338238,"max":0,"pool":2,"namespace":"","max":0}' remove-clone-metadata 4896 Clone 1320 not presentdry-run: Nothing changed However, the remove operation sees the snapshot and refuses to delete the object. # ceph-objectstore-tool --dry-run --type bluestore --data-path /var/lib/ceph/osd/ceph-313/ --pgid 2.13e '{"oid":"rb.0.2479b45.238e1f29.00125cbb","key":"","snapid":-2,"hash":2016338238,"max":0,"pool":2,"namespace":"","max":0}' remove Snapshots are present, use removeall to delete everything dry-run: Nothing changed Listing the inconsistencies with rados, it appears that the phantom snapshot is present on 2/3 replicas. Other PGs had this issue, but on 1/3 replicas and using removeall on the bad copy, then repairing the PG fixed the issue. Running removeall on the primary replica resulted in the repair replicating the other bad object. Should I just issue removeall on both OSDs and then run repair to fix the missing objects, or is there some other way to purge snaps on an object? (I've already purged all snapshots on all images in the cluster with rbd snap purge) Thoughts? # rados list-inconsistent-obj 2.13e { "epoch": 1008264, "inconsistents": [ { "object": { "name": "rb.0.2479b45.238e1f29.00125cbb", "nspace": "", "locator": "", "snap": "head", "version": 2024222 }, "errors": [ "object_info_inconsistency", "snapset_inconsistency" ], "union_shard_errors": [ ], "selected_object_info": { "oid": { "oid": "rb.0.2479b45.238e1f29.00125cbb", "key": "", "snapid": -2, "hash": 2016338238, "max": 0, "pool": 2, "namespace": "" }, "version": "946857'2041225", "prior_version": "943431'2032262", "last_reqid": "osd.36.0:48196", "user_version": 2024222, "size": 4194304, "mtime": "2018-05-13 08:58:21.359912", "local_mtime": "2018-05-13 08:58:21.537637", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "legacy_snaps": [ ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x0d99bd77", "omap_digest": "0x", "expected_object_size": 4194304, "expected_write_size": 4194304, "alloc_hint_flags": 0, "manifest": { "type": 0, "redirect_target": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9.2233720368548e
[ceph-users] unable to remove phantom snapshot for object, snapset_inconsistency
rors": [ ], "size": 4194304, "omap_digest": "0x", "data_digest": "0x0d99bd77", "object_info": { "oid": { "oid": "rb.0.2479b45.238e1f29.00125cbb", "key": "", "snapid": -2, "hash": 2016338238, "max": 0, "pool": 2, "namespace": "" }, "version": "946857'2041225", "prior_version": "943431'2032262", "last_reqid": "osd.36.0:48196", "user_version": 2024222, "size": 4194304, "mtime": "2018-05-13 08:58:21.359912", "local_mtime": "2018-05-13 08:58:21.537637", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "legacy_snaps": [ ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x0d99bd77", "omap_digest": "0x", "expected_object_size": 4194304, "expected_write_size": 4194304, "alloc_hint_flags": 0, "manifest": { "type": 0, "redirect_target": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9.2233720368548e+18, "namespace": "" } }, "watchers": { } }, "snapset": { "snap_context": { "seq": 4896, "snaps": [ ] }, "head_exists": 1, "clones": [ ] } }, { "osd": 305, "primary": false, "errors": [ ], "size": 4194304, "omap_digest": "0x", "data_digest": "0x0d99bd77", "object_info": { "oid": { "oid": "rb.0.2479b45.238e1f29.00125cbb", "key": "", "snapid": -2, "hash": 2016338238, "max": 0, "pool": 2, "namespace": "" }, "version": "943431'2032262", "prior_version": "942275'2030618", "last_reqid": "osd.36.0:48196", "user_version": 2024222, "size": 4194304, "mtime": "2018-05-13 08:58:21.359912", "local_mtime": "2018-05-13 08:58:21.537637", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "legacy_snaps": [ ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x0d99bd77", "omap_digest": "0x", "expected_object_size": 4194304, "expected_write_size": 4194304, "alloc_hint_flags": 0, "manifest": { "type": 0, "redirect_target": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9.2233720368548e+18, "namespace": "" } }, "watchers": { } }, "snapset": { "snap_context": { "seq": 4896, "snaps": [ 4896 ] }, "head_exists": 1, "clones": [ ] } }, { "osd": 313, "primary": true, "errors": [ ], "size": 4194304, "omap_digest": "0x", "data_digest": "0x0d99bd77", "object_info": { "oid": { "oid": "rb.0.2479b45.238e1f29.00125cbb", "key": "", "snapid": -2, "hash": 2016338238, "max": 0, "pool": 2, "namespace": "" }, "version": "943431'2032262", "prior_version": "942275'2030618", "last_reqid": "osd.36.0:48196", "user_version": 2024222, "size": 4194304, "mtime": "2018-05-13 08:58:21.359912", "local_mtime": "2018-05-13 08:58:21.537637", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "legacy_snaps": [ ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x0d99bd77", "omap_digest": "0x", "expected_object_size": 4194304, "expected_write_size": 4194304, "alloc_hint_flags": 0, "manifest": { "type": 0, "redirect_target": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9.2233720368548e+18, "namespace": "" } }, "watchers": { } }, "snapset": { "snap_context": { "seq": 4896, "snaps": [ 4896 ] }, "head_exists": 1, "clones": [ ] } } ] } ] } -- Steve Anthony LTS HPC Senior Analyst Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())
One addendum for the sake of completeness. A few PGs still refused to repair even after the clone object was gone. To resolve this I needed to remove the clone metadata from the HEAD using ceph-objectstore-tool. First, I found the problematic clone ID in the log on the primary replica: ceph2:~# grep ERR /var/log/ceph/ceph-osd.229.log 2018-06-25 10:59:37.554924 7fbdd80d2700 -1 log_channel(cluster) log [ERR] : repair 2.9a6 2:65942a51:::rb.0.2479b45.238e1f29.002d338d:head expected clone 2:65942a51:::rb.0.2479b45.238e1f29.002d338d:1320 1 missing In this case the clone ID is 1320. Note that this is the hex value and ceph-objectstore-tool will expect the decimal equivalent (4896 in this case). Then on each host stop the OSD and remove the metadata. For Bluestore this looks like: ceph2:~# ceph-objectstore-tool --type bluestore --data-path /var/lib/ceph/osd/ceph-229/ --pgid 2.9a6 '{"oid":"rb.0.2479b45.238e1f29","snapid":-2,"hash":2320771494,"max":0,"pool":2,"namespace":"","max":0}' remove-clone-metadata 4896 Removal of clone 1320 complete Use pg repair after OSD restarted to correct stat information And if it's a Filestore OSD: ceph15:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-122/ --pgid 2.9a6 '{"oid":"rb.0.2479b45.238e1f29.002d338d","key":"","snapid":-2,"hash":2320771494,"max":0,"pool":2,"namespace":"","max":0}' remove-clone-metadata 4896 Removal of clone 1320 complete Use pg repair after OSD restarted to correct stat information Once that's done, starting the OSD and repairing the PG finally marked it as clean. -Steve On 06/14/2018 05:07 PM, Steve Anthony wrote: > > For reference, building luminous with the changes in the pull request > also fixed this issue for me. Some of my unexpected snapshots were on > Bluestore devices; here's how I used the objectstore tool to remove > them. In the example, the problematic placement group is 2.1c3f, and > the unexpected clone is identified in the OSD's log as > rb.0.2479b45.238e1f29.0df:1356. > > ceph14:~# systemctl stop ceph-osd@34.service > ceph14:~# ceph-objectstore-tool --type bluestore --data-path > /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f --op list > rb.0.2479b45.238e1f29.0ddf > Error getting attr on : 2.1c3f_head,#-4:fc38:::scrub_2.1c3f:head#, (61) > No data available > ["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}] > ["2.1c3f",{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":-2,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}] > ceph14:~# ceph-objectstore-tool --dry-run --type bluestore --data-path > /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f > '{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}' > remove > remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356# > dry-run: Nothing changed > ceph14:~# ceph-objectstore-tool --type bluestore --data-path > /var/lib/ceph/osd/ceph-34/ --pgid 2.1c3f > '{"oid":"rb.0.2479b45.238e1f29.0ddf","key":"","snapid":4950,"hash":1151294527,"max":0,"pool":2,"namespace":"","max":0}' > remove > remove #2:fc3af922:::rb.0.2479b45.238e1f29.0ddf:1356# > ceph14:~# systemctl start ceph-osd@34.service > > -Steve > > On 06/14/2018 04:59 PM, Nick Fisk wrote: >> For completeness in case anyone has this issue in the future and stumbles >> across this thread >> >> If your OSD is crashing and you are still running on a Luminous build that >> does not have the fix in the pull request below, you will >> need to compile the ceph-osd binary and replace it on the affected OSD node. >> This will get your OSD's/cluster back up and running. >> >> In regards to the stray object/clone, I was unable to remove it using the >> objectstore tool, I'm guessing this is because as far as >> the OSD is concerned it believes that clone should have already been >> deleted. I am still running Filestore on this cluster and >> simply removing the clone object from the OSD PG folder (Note: the object >> won't have _head in its name) and then running a deep >> scrub
Re: [ceph-users] FAILED assert(p != recovery_info.ss.clone_snaps.end())
479:1c, version: 2195927'1249660, data_included: [], > data_size: 0, omap_header_size: 0, omap_entries_size: 0, > attrset_size: 1, recovery_info: > ObjectRecoveryInfo(1:534b0c9f:::rbd_data.0c4c14 > 238e1f29.000bf479:1c@2195927'1249660, size: 4194304, copy_subset: [], > clone_subset: {}, snapset: 1c=[]:{}), after_progress: > ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, > omap_re covered_to:, omap_complete:true, error:false), > before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, > data_complete:false, omap_recovered_to:, omap_complete:false, > error:false))]) v3 909+0+0 (7 > 22394556 0 0) 0x5574480d0d80 con 0x557447510800 > -2> 2018-06-05 16:28:59.560183 7fcd7b655700 5 -- > [2a03:25e0:254:5::113]:6829/525383 >> [2a03:25e0:254:5::12]:6809/5784710 > conn(0x557447510800 :6829 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH > pgs=13524 cs > =1 l=0). rx osd.46 seq 7 0x55744783f900 pg_backfill(progress 1.2ca e > 2196813/2196813 lb > 1:534b0b88:::rbd_data.f870ac238e1f29.000ff145:head) v3 > -1> 2018-06-05 16:28:59.560189 7fcd7b655700 1 -- > [2a03:25e0:254:5::113]:6829/525383 <== osd.46 > [2a03:25e0:254:5::12]:6809/5784710 7 pg_backfill(progress 1.2ca e > 2196813/2196813 lb 1:534b0b88:::rbd_data > .f870ac238e1f29.000ff145:head) v3 946+0+0 (3865576583 0 0) > 0x55744783f900 con 0x557447510800 > 0> 2018-06-05 16:28:59.564054 7fcd5f3ba700 -1 > /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void > PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, > ObjectContextR ef, bool, ObjectStore::Transaction*)' > thread 7fcd5f3ba700 time 2018-06-05 16:28:59.561060 > /build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != > recovery_info.ss.clone_snaps.end()) > > ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x557424971a02] > 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo > const&, std::shared_ptr, bool, > ObjectStore::Transaction*)+0xd63) [0x5574244df873] > 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, > ObjectStore::Transaction*)+0x2da) [0x5574246715ca] > 4: (ReplicatedBackend::_do_push(boost::intrusive_ptr)+0x12e) > [0x5574246717fe] > 5: > (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2c1) > [0x557424680d71] > 6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) > [0x55742458c440] > 7: (PrimaryLogPG::do_request(boost::intrusive_ptr&, > ThreadPool::TPHandle&)+0x543) [0x5574244f0853] > 8: (OSD::dequeue_op(boost::intrusive_ptr, > boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9) > [0x557424367539] > 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr > const&)+0x57) [0x557424610f37] > 10: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x1047) [0x557424395847] > 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) > [0x5574249767f4] > 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557424979830] > 13: (()+0x76ba) [0x7fcd7f1cb6ba] > 14: (clone()+0x6d) [0x7fcd7e24241d] > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Senior Analyst Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())
I'm seeing this again on two OSDs after adding another 20 disks to my cluster. Is there someway I can maybe determine which snapshots the recovery process is looking for? Or maybe find and remove the objects it's trying to recover, since there's apparently a problem with them? Thanks! -Steve On 05/18/2017 01:06 PM, Steve Anthony wrote: > > Hmmm, after crashing for a few days every 30 seconds it's apparently > running normally again. Weird. I was thinking since it's looking for a > snapshot object, maybe re-enabling snaptrimming and removing all the > snapshots in the pool would remove that object (and the problem)? > Never got to that point this time, but I'm going to need to cycle more > OSDs in and out of the cluster, so if it happens again I might try > that and update. > > Thanks! > > -Steve > > > On 05/17/2017 03:17 PM, Gregory Farnum wrote: >> >> >> On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma...@lehigh.edu >> <mailto:sma...@lehigh.edu>> wrote: >> >> Hello, >> >> After starting a backup (create snap, export and import into a second >> cluster - one RBD image still exporting/importing as of this message) >> the other day while recovery operations on the primary cluster were >> ongoing I noticed an OSD (osd.126) start to crash; I reweighted >> it to 0 >> to prepare to remove it. Shortly thereafter I noticed the problem >> seemed >> to move to another OSD (osd.223). After looking at the logs, I >> noticed >> they appeared to have the same problem. I'm running Ceph version >> 9.2.1 >> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8. >> >> Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe >> >> Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA >> >> >> May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15 >> 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors >> {default=true} >> May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897 >> 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void >> ReplicatedPG::on_local_recover(const hobject_t&, const >> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, >> ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15 >> 10:39:55.322306 >> May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: >> FAILED >> assert(recovery_info.oi.snaps.size()) >> >> May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391 >> 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true} >> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In >> function >> 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const >> object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, >> ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15 >> 16:45:30.799839 >> May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: >> FAILED >> assert(recovery_info.oi.snaps.size()) >> >> >> I did some searching and thought it might be related to >> http://tracker.ceph.com/issues/13837 aka >> https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled >> scrubbing and deep-scrubbing, and set >> osd_pg_max_concurrent_snap_trims >> to 0 for all OSDs. No luck. I had changed the systemd service file to >> automatically restart osd.223 while recovery was happening, but it >> appears to have stalled; I suppose it's needed up for the >> remaining objects. >> >> >> Yeah, these aren't really related that I can see — though I haven't >> spent much time in this code that I can recall. The OSD is receiving >> a "push" as part of log recovery and finds that the object it's >> receiving is a snapshot object without having any information about >> the snap IDs that exist, which is weird. I don't know of any way a >> client could break it either, but maybe David or Jason know something >> more. >> -Greg >> >> >> >> I didn't see anything else online, so I thought I see if anyone >> has seen >> this before or has any other ideas. Thanks for taking the time. >> >> -Steve >> >> >> -- >> Steve Anthony >> LTS HPC Senior Analyst >> Lehigh University >> sma...@lehigh.edu <mailto:sma...@lehigh.edu> &g
Re: [ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())
Hmmm, after crashing for a few days every 30 seconds it's apparently running normally again. Weird. I was thinking since it's looking for a snapshot object, maybe re-enabling snaptrimming and removing all the snapshots in the pool would remove that object (and the problem)? Never got to that point this time, but I'm going to need to cycle more OSDs in and out of the cluster, so if it happens again I might try that and update. Thanks! -Steve On 05/17/2017 03:17 PM, Gregory Farnum wrote: > > > On Wed, May 17, 2017 at 10:51 AM Steve Anthony <sma...@lehigh.edu > <mailto:sma...@lehigh.edu>> wrote: > > Hello, > > After starting a backup (create snap, export and import into a second > cluster - one RBD image still exporting/importing as of this message) > the other day while recovery operations on the primary cluster were > ongoing I noticed an OSD (osd.126) start to crash; I reweighted it > to 0 > to prepare to remove it. Shortly thereafter I noticed the problem > seemed > to move to another OSD (osd.223). After looking at the logs, I noticed > they appeared to have the same problem. I'm running Ceph version 9.2.1 > (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8. > > Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe > > Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA > > > May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15 > 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors > {default=true} > May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897 > 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void > ReplicatedPG::on_local_recover(const hobject_t&, const > object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, > ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15 > 10:39:55.322306 > May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: > FAILED > assert(recovery_info.oi.snaps.size()) > > May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391 > 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true} > May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In > function > 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const > object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, > ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15 > 16:45:30.799839 > May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: > FAILED > assert(recovery_info.oi.snaps.size()) > > > I did some searching and thought it might be related to > http://tracker.ceph.com/issues/13837 aka > https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled > scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims > to 0 for all OSDs. No luck. I had changed the systemd service file to > automatically restart osd.223 while recovery was happening, but it > appears to have stalled; I suppose it's needed up for the > remaining objects. > > > Yeah, these aren't really related that I can see — though I haven't > spent much time in this code that I can recall. The OSD is receiving a > "push" as part of log recovery and finds that the object it's > receiving is a snapshot object without having any information about > the snap IDs that exist, which is weird. I don't know of any way a > client could break it either, but maybe David or Jason know something > more. > -Greg > > > > I didn't see anything else online, so I thought I see if anyone > has seen > this before or has any other ideas. Thanks for taking the time. > > -Steve > > > -- > Steve Anthony > LTS HPC Senior Analyst > Lehigh University > sma...@lehigh.edu <mailto:sma...@lehigh.edu> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Steve Anthony LTS HPC Senior Analyst Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD crash loop - FAILED assert(recovery_info.oi.snaps.size())
Hello, After starting a backup (create snap, export and import into a second cluster - one RBD image still exporting/importing as of this message) the other day while recovery operations on the primary cluster were ongoing I noticed an OSD (osd.126) start to crash; I reweighted it to 0 to prepare to remove it. Shortly thereafter I noticed the problem seemed to move to another OSD (osd.223). After looking at the logs, I noticed they appeared to have the same problem. I'm running Ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) on Debian 8. Log for osd.126 from start to crash: https://pastebin.com/y4fn94xe Log for osd.223 from start to crash: https://pastebin.com/AE4CYvSA May 15 10:39:55 ceph13 ceph-osd[21506]: -9308> 2017-05-15 10:39:51.561342 7f225c385900 -1 osd.126 616621 log_to_monitors {default=true} May 15 10:39:55 ceph13 ceph-osd[21506]: 2017-05-15 10:39:55.328897 7f2236be3700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7f2236be3700 time 2017-05-15 10:39:55.322306 May 15 10:39:55 ceph13 ceph-osd[21506]: osd/ReplicatedPG.cc: 192: FAILED assert(recovery_info.oi.snaps.size()) May 15 16:45:25 ceph19 ceph-osd[30527]: 2017-05-15 16:45:25.343391 7ff40f41e900 -1 osd.223 619808 log_to_monitors {default=true} May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::on_local_recover(const hobject_t&, const object_stat_sum_t&, const ObjectRecoveryInfo&, ObjectContextRef, ObjectStore::Transaction*)' thread 7ff3eab63700 time 2017-05-15 16:45:30.799839 May 15 16:45:30 ceph19 ceph-osd[30527]: osd/ReplicatedPG.cc: 192: FAILED assert(recovery_info.oi.snaps.size()) I did some searching and thought it might be related to http://tracker.ceph.com/issues/13837 aka https://bugzilla.redhat.com/show_bug.cgi?id=1351320 so I disabled scrubbing and deep-scrubbing, and set osd_pg_max_concurrent_snap_trims to 0 for all OSDs. No luck. I had changed the systemd service file to automatically restart osd.223 while recovery was happening, but it appears to have stalled; I suppose it's needed up for the remaining objects. I didn't see anything else online, so I thought I see if anyone has seen this before or has any other ideas. Thanks for taking the time. -Steve -- Steve Anthony LTS HPC Senior Analyst Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] download.ceph.com metadata problem?
It looks like there might be an issue with the repo metadata. I'm not seeing ceph, ceph-common, librbd1, etc. in the debian-giant wheezy branch. I ended up just downloading the debs and installing them manually in the interim. FYI. -Steve cat /etc/apt/sources.list.d/ceph.list deb http://download.ceph.com/debian-giant/ wheezy main grep Package /var/lib/apt/lists/download.ceph.com_debian-giant_dists_wheezy_main_binary-amd64_Packages Package: ceph-dbg Package: ceph-deploy Package: ceph-fs-common Package: ceph-fuse Package: ceph-fuse-dbg Package: ceph-test Package: librados2-dbg Package: radosgw-agent apt-cache policy ceph ceph: Installed: 0.87.2-1~bpo70+1 Candidate: 0.87.2-1~bpo70+1 Version table: *** 0.87.2-1~bpo70+1 0 100 /var/lib/dpkg/status 0.80.7-1~bpo70+1 0 100 http://debian.cc.lehigh.edu/debian/ wheezy-backports/main amd64 Packages -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] nfs over rbd problem
cib: info: cib_perform_op: > + /cib: @num_updates=162 > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > + /cib/status/node_state[@id='node2']: > @crm-debug-origin=do_update_resource > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > + > /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']/lrm_rsc_op[@id='p_rbd_map_1_last_0']: > > @operation_key=p_rbd_map_1_start_0, @operation=start, > @transition-key=6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, > @transition-magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, > @call-id=48, @rc-code=1, @op-status=2, @last-run=1450430539, > @last-rc-change=1450430539, @exec-time=20002 > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > ++ > /cib/status/node_state[@id='node2']/lrm[@id='node2']/lrm_resources/lrm_resource[@id='p_rbd_map_1']: > > operation_key="p_rbd_map_1_start_0" operation="start" > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > transition-key="6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9" > transition-magic="2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9" > call-id="48" rc-code="1" op-status="2" interval="0" l > Dec 18 17:22:39 [2690] node2cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=node2/crmd/99, version=0.69.162) > Dec 18 17:22:39 [2695] node2 crmd: warning: status_from_rc: > Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1): > Error > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: notice: > abort_transition_graph: Transition aborted by p_rbd_map_1_start_0 > 'modify' on node2: Event failed > (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162, > source=match_graph_event:344, 0) > Dec 18 17:22:39 [2695] node2 crmd: info: match_graph_event: > Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4) > Dec 18 17:22:39 [2693] node2 attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > fail-count-p_rbd_map_1 (INFINITY) > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: info: > process_graph_event: Detected action (3.6) > p_rbd_map_1_start_0.48=unknown error: failed > Dec 18 17:22:39 [2695] node2 crmd: warning: status_from_rc: > Action 6 (p_rbd_map_1_start_0) on node2 failed (target: 0 vs. rc: 1): > Error > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: info: > abort_transition_graph: Transition aborted by p_rbd_map_1_start_0 > 'create' on (null): Event failed > (magic=2:1;6:3:0:1b17b95d-a029-4ea5-be6d-4e5d8add6ca9, cib=0.69.162, > source=match_graph_event:344, 0) > Dec 18 17:22:39 [2695] node2 crmd: info: match_graph_event: > Action p_rbd_map_1_start_0 (6) confirmed on node2 (rc=4) > Dec 18 17:22:39 [2695] node2 crmd: warning: update_failcount: > Updating failcount for p_rbd_map_1 on node2 after failed start: > rc=1 (update=INFINITY, time=1450430559) > Dec 18 17:22:39 [2695] node2 crmd: info: > process_graph_event: Detected action (3.6) > p_rbd_map_1_start_0.48=unknown error: failed > Dec 18 17:22:39 [2693] node2 attrd: notice: > attrd_perform_update: Sent update 28: fail-count-p_rbd_map_1=INFINITY > Dec 18 17:22:39 [2690] node2cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to master (origin=local/attrd/28) > Dec 18 17:22:39 [2695] node2 crmd: notice: run_graph: > Transition 3 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-234.bz2): Stopped > Dec 18 17:22:39 [2695] node2 crmd: info: > do_state_transition: State transition S_TRANSITION_ENGINE -> > S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=notify_crmd ] > Dec 18 17:22:39 [2693] node2 attrd: notice: > attrd_trigger_update: Sending flush op to all hosts for: > last-failure-p_rbd_map_1 (1450430559) > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > Diff: --- 0.69.162 2 > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > Diff: +++ 0.69.163 (null) > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > + /cib: @num_updates=163 > Dec 18 17:22:39 [2690] node2cib: info: cib_perform_op: > ++ > /cib/status/node_state[@id='node2']/transient_attributes[@id='node2']/instance_attributes[@id='status-node2']: > > name="fail-count-p_rbd_map_1" value="INFINITY"/> > . > > thanks > > > > > > > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing OSD - double rebalance?
It's probably worth noting that if you're planning on removing multiple OSDs in this manner, you should make sure they are not in the same failure domain, per your CRUSH rules. For example, if you keep one replica per node and three copies (as in the default) and remove OSDs from multiple nodes without marking them as out first, you risk losing data if they are in the same placement group, depending on the number of replicas you have and the number of OSDs you simultaneously remove. That said, it would be safe in the above scenario to remove multiple OSDs from a single node simultaneously, since the CRUSH rules aren't placing multiple replicas on the same host. -Steve On 11/30/2015 04:33 AM, Wido den Hollander wrote: > > On 30-11-15 10:08, Carsten Schmitt wrote: >> Hi all, >> >> I'm running ceph version 0.94.5 and I need to downsize my servers >> because of insufficient RAM. >> >> So I want to remove OSDs from the cluster and according to the manual >> it's a pretty straightforward process: >> I'm beginning with "ceph osd out {osd-num}" and the cluster starts >> rebalancing immediately as expected. After the process is finished, the >> rest should be quick: >> Stop the daemon "/etc/init.d/ceph stop osd.{osd-num}" and remove the OSD >> from the crush map: "ceph osd crush remove {name}" >> >> But after entering the last command, the cluster starts rebalancing again. >> >> And that I don't understand: Shouldn't be one rebalancing process enough >> or am I missing something? >> > Well, for CRUSH this are two different things. First, the weight of the > node goes to 0 (zero), but it's still a part of the CRUSH map. > > Say, there are still 5 OSDS on that host, 4 with a weight of X and one > with a weight of zero. > > When you remove the OSD, there are only 4 OSDs left, that's a change for > CRUSH. > > What you should do in this case. Only remove the OSD from CRUSH and > don't mark it as out. > > When the cluster is done you can mark it out, but that won't cause a > rebalance since it's already out of the CRUSH map. > > It will still work with the other OSDs to migrate the data since the > cluster knows it had that PG information. > >> My config is pretty vanilla, except for: >> [osd] >> osd recovery max active = 4 >> osd max backfills = 4 >> >> Thanks in advance, >> Carsten >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can't activate osd in infernalis
> command_check_call >>> >>> > [ceph01][WARNIN] return subprocess.check_call(arguments) >>> >>> > [ceph01][WARNIN] File >>> "/usr/lib64/python2.7/subprocess.py", line >>> >>> > 542, in check_call >>> >>> > [ceph01][WARNIN] raise CalledProcessError(retcode, cmd) >>> >>> > [ceph01][WARNIN] subprocess.CalledProcessError: Command >>> >>> > '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', >>> '--mkkey', '-i', >>> >>> > '0', '--monmap', >>> '/var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap', >>> >>> > '--osd-data', '/var/lib/ceph/tmp/mnt.pmHRuu', >>> '--osd-journal', >>> >>> > '/var/lib/ceph/tmp/mnt.pmHRuu/journal', '--osd-uuid', >>> >>> > 'de162e24-16b6-4796-b6b9-774fdb8ec234', '--keyring', >>> >>> > '/var/lib/ceph/tmp/mnt.pmHRuu/keyring', '--setuser', 'ceph', >>> >>> > '--setgroup', 'ceph']' returned non-zero exit status 1 >>> >>> > [ceph01][ERROR ] RuntimeError: command returned non-zero >>> exit status: 1 >>> >>> > [ceph_deploy][ERROR ] RuntimeError: Failed to execute >>> command: >>> >>> > ceph-disk -v activate --mark-init systemd --mount /dev/sda1 >>> >>> > >>> >>> > The output of ls -lahn in /var/lib/ceph/ is >>> >>> > >>> >>> > drwxr-x---. 9 167 167 4,0K 19. Nov 10:32 . >>> >>> > drwxr-xr-x. 28 0 0 4,0K 19. Nov 11:14 .. >>> >>> > drwxr-x---. 2 167 1676 10. Nov 13:06 bootstrap-mds >>> >>> > drwxr-x---. 2 167 167 25 19. Nov 10:48 bootstrap-osd >>> >>> > drwxr-x---. 2 167 1676 10. Nov 13:06 bootstrap-rgw >>> >>> > drwxr-x---. 2 167 1676 10. Nov 13:06 mds >>> >>> > drwxr-x---. 2 167 1676 10. Nov 13:06 mon >>> >>> > drwxr-x---. 2 167 1676 10. Nov 13:06 osd >>> >>> > drwxr-x---. 2 167 167 65 19. Nov 11:22 tmp >>> >>> > >>> >>> > >>> >>> > I hope someone can help me, I am really lost right now. >>> >>> > >>> >>> >>> >>> -- >>> >>> Mit freundlichen Grüßen >>> >>> >>> >>> David Riedl >>> >>> >>> >>> >>> >>> >>> >>> WINGcon GmbH Wireless New Generation - Consulting & Solutions >>> >>> >>> >>> Phone: +49 (0) 7543 9661 - 26 >>> <tel:%2B49%20%280%29%207543%209661%20-%2026> >>> >>> E-Mail: david.ri...@wingcon.com >>> >>> Web: http://www.wingcon.com >>> >>> >>> >>> Sitz der Gesellschaft: Langenargen >>> >>> Registergericht: ULM, HRB 632019 >>> >>> USt-Id.: DE232931635, WEEE-Id.: DE74015979 >>> >>> Geschäftsführer: Norbert Schäfer, Fritz R. Paul >>> >>> >>> >>> ___ >>> >>> ceph-users mailing list >>> >>> ceph-users@lists.ceph.com >>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> >>> >>> -- >>> Mykola* * >> >> -- >> Mit freundlichen Grüßen >> >> David Riedl >> >> >> >> WINGcon GmbH Wireless New Generation - Consulting & Solutions >> >> Phone: +49 (0) 7543 9661 - 26 >> <tel:%2B49%20%280%29%207543%209661%20-%2026> >> E-Mail: david.ri...@wingcon.com <mailto:david.ri...@wingcon.com> >> Web: http://www.wingcon.com >> >> Sitz der Gesellschaft: Langenargen >> Registergericht: ULM, HRB 632019 >> USt-Id.: DE232931635, WEEE-Id.: DE74015979 >> Geschäftsführer: Norbert Schäfer, Fritz R. Paul >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> -- >> Mykola* * > > -- > Mit freundlichen Grüßen > > David Riedl > > > > WINGcon GmbH Wireless New Generation - Consulting & Solutions > > Phone: +49 (0) 7543 9661 - 26 > E-Mail: david.ri...@wingcon.com > Web: http://www.wingcon.com > > Sitz der Gesellschaft: Langenargen > Registergericht: ULM, HRB 632019 > USt-Id.: DE232931635, WEEE-Id.: DE74015979 > Geschäftsführer: Norbert Schäfer, Fritz R. Paul > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] upgrading 0.94.5 to 9.2.0 notes
On journal device permissions see my reply in "Can't activate osd in infernalis". Basically, if you set the partition type GUID to 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (the Ceph journal type GUID), the existing Ceph udev rules will set permissions on the partitions correctly at boot. Changing the ownership on the journal partitions manually will not persist across reboots. Easy reference: http://www.spinics.net/lists/ceph-users/msg23685.html -Steve On 11/20/2015 10:14 AM, Kenneth Waegeman wrote: > Hi, > > I recently started a test to upgrade ceph from 0.94.5 to 9.2.0 on > Centos7. I had some issues not mentioned in the release notes. Hereby > some notes: > > * Upgrading instructions are only in the release notes, not updated on > the upgrade page in the docs: > http://docs.ceph.com/docs/master/install/upgrading-ceph/ > > * Once you've updated the packages, `service ceph stop` or `service > ceph stop ` won't actually work anymore, is pointing to a > non-existing target. This is a step in the upgrade procedure I > couldn't do, I manually killed the processes. > [root@ceph001 ~]# service ceph stop osd > Redirecting to /bin/systemctl stop osd ceph.service > Failed to issue method call: Unit osd.service not loaded > > * You also need to chown the journal partitions used for the osds. > only chowning /var/lib/ceph is not enough > > * Permissions on log files are not completely ok. The /var/log/ceph > folder is owned by ceph, but existing files are still owned by root, > so I had to manually chown these, otherwise I got messages like this: > 2015-11-13 11:32:26.641870 7f55a4ffd700 1 mon.ceph003@2(peon).log > v4672 unable to write to '/var/log/ceph/ceph.log' for channel > 'cluster': (13) Permission denied > > .* I still get messages like these in the log files, not sure if they > are harmless or not: > > 2015-11-13 11:52:53.840414 7f610f376700 -1 lsb_release_parse - pclose > failed: (13) Permission denied > > * systemctl start ceph.target does not start my osds.., I have to > start them all with systemctl start ceph-osd@... > * systemctl restart ceph.target restart the running osds, but not the > osds that are not yet running. > * systemctl stop ceph.target stops everything, as expected :) > > I didn't tested everything thoroughly yet, but does someone has seen > the same issues? > > Thanks! > > Kenneth > _______ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Interesting postmortem on SSDs from Algolia
There's often a great deal of discussion about which SSDs to use for journals, and why some of the cheaper SSDs end up being more expensive in the long run. The recent blog post at Algoria, though not Ceph specific, provides a good illustration of exactly how insidious kernel/SSD interactions can be. Thought the list might find it interesting. https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/ -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup hundreds or thousands of TB
Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Managing larger ceph clusters
, but feel that moving the management of our clusters to standard tools would provide a little more consistency and help prevent some mistakes that have happened while using ceph-deploy. We're looking at using the same tools we use in our OpenStack environment (puppet/ansible), but I'm interested in hearing from people using chef/salt/juju as well. Some of the cluster operation tasks that I can think of along with ideas/concerns I have are: Keyring management Seems like hiera-eyaml is a natural fit for storing the keyrings. ceph.conf I believe the puppet ceph module can be used to manage this file, but I'm wondering if using a template (erb?) might be better method to keeping it organized and properly documented. Pool configuration The puppet module seems to be able to handle managing replicas and the number of placement groups, but I don't see support for erasure coded pools yet. This is probably something we would want the initial configuration to be set up by puppet, but not something we would want puppet changing on a production cluster. CRUSH maps Describing the infrastructure in yaml makes sense. Things like which servers are in which rows/racks/chassis. Also describing the type of server (model, number of HDDs, number of SSDs) makes sense. CRUSH rules I could see puppet managing the various rules based on the backend storage (HDD, SSD, primary affinity, erasure coding, etc). Replacing a failed HDD disk Do you automatically identify the new drive and start using it right away? I've seen people talk about using a combination of udev and special GPT partition IDs to automate this. If you have a cluster with thousands of drives I think automating the replacement makes sense. How do you handle the journal partition on the SSD? Does removing the old journal partition and creating a new one create a hole in the partition map (because the old partition is removed and the new one is created at the end of the drive)? Replacing a failed SSD journal Has anyone automated recreating the journal drive using Sebastien Han's instructions, or do you have to rebuild all the OSDs as well? http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou rnal-failure/ http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou%0Arnal-failure/ Adding new OSD servers How are you adding multiple new OSD servers to the cluster? I could see an ansible playbook which disables nobackfill, noscrub, and nodeep-scrub followed by adding all the OSDs to the cluster being useful. Upgrading releases I've found an ansible playbook for doing a rolling upgrade which looks like it would work well, but are there other methods people are using? http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi ble/ http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi%0Able/ Decommissioning hardware Seems like another ansible playbook for reducing the OSDs weights to zero, marking the OSDs out, stopping the service, removing the OSD ID, removing the CRUSH entry, unmounting the drives, and finally removing the server would be the best method here. Any other ideas on how to approach this? That's all I can think of right now. Is there any other tasks that people have run into that are missing from this list? Thanks, Bryan This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc
Re: [ceph-users] Replication question
Actually, it's more like 41TB. It's a bad idea to run at near full capacity (by default past 85%) because you need some space where Ceph can replicate data as part of its healing process in the event of disk or node failure. You'll get a health warning when you exceed this ratio. You can use erasure coding to increase the amount of data you can store beyond 41TB, but you'll still need some replicated disk as a caching layer in front of the erasure coded pool if you're using RBD. See: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/036430.html As to how much space you can save with erasure coding, that will depend on if you're using RBD and need a cache layer and the values you set for k and m (number of data chunks and coding chunks). There's been some discussion on the list with regards to choosing those values. -Steve On 03/12/2015 10:07 AM, Thomas Foster wrote: I am looking into how I can maximize my space with replication, and I am trying to understand how I can do that. I have 145TB of space and a replication of 3 for the pool and was thinking that the max data I can have in the cluster is ~47TB in my cluster at one time..is that correct? Or is there a way to get more data into the cluster with less space using erasure coding? Any help would be greatly appreciated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] import-diff requires snapshot exists?
Hello, I've been playing with backing up images from my production site (running 0.87) to my backup site (running 0.87.1) using export/import and export-diff/import-diff. After initially exporting and importing the image (rbd/small to backup/small) I took a snapshot (called test1) on the production cluster, ran export-diff from that snapshot, and then attempted to import-diff the diff file on the backup cluster. # rbd import-diff ./foo.diff backup/small start snapshot 'test1' does not exist in the image, aborting Importing image diff: 0% complete...failed. rbd: import-diff failed: (22) Invalid argument This works fine if I create a test1 snapshot on the backup cluster before running import-diff. However, it appears that the changes get written into backup/small not backup/small@test1. So unless I'm not understanding something, it seems like the content of the snapshot on the backup cluster is of no importance, which makes me wonder why it must exist at all. Any thoughts? Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] import-diff requires snapshot exists?
Jason, Ah, ok that makes sense. I was forgetting snapshots are read-only. Thanks! My plan was to do something like this. First, create a sync snapshot and seed the backup: rbd snap create rbd/small@sync rbd export rbd/small@sync ./foo rbd import ./foo backup/small rbd snap create backup/small@sync Then each day, create a daily snap on the backup cluster: rbd snap create backup/small@2015-02-03 Then send that day's changes: rbd export-diff --from-snap sync rbd/small ./foo.diff rbd import-diff ./foo.diff rbd/small Then remove and recreate the daily snap marker to prepare for the next sync. rbd snap rm rbd/small@sync rbd snap rm backup/small@sync rbd snap create rbd/small@sync rbd snap create backup/small@sync Finally remove any dated snapshots on the remote cluster outside the retention window. -Steve On 03/03/2015 04:37 PM, Jason Dillaman wrote: Snapshots are read-only, so all changes to the image can only be applied to the HEAD revision. In general, you should take a snapshot prior to export / export-diff to ensure consistent images: rbd snap create rbd/small@snap1 rbd export rbd/small@snap1 ./foo rbd import ./foo backup/small rbd snap create backup/small@snap1 ** rbd/small and backup/small are now consistent through snap1 -- rbd/small might have been modified post snapshot rbd snap create rbd/small@snap2 rbd export-diff --from-snap snap1 rbd/small@snap2 ./foo.diff rbd import-diff ./foo.diff backup/small ** rbd/small and backup/small are now consistent through snap2. import-diff automatically created backup/small@snap2 after importing all changes. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Steve Anthony sma...@lehigh.edu To: ceph-users@lists.ceph.com Sent: Tuesday, March 3, 2015 2:06:44 PM Subject: [ceph-users] import-diff requires snapshot exists? Hello, I've been playing with backing up images from my production site (running 0.87) to my backup site (running 0.87.1) using export/import and export-diff/import-diff. After initially exporting and importing the image (rbd/small to backup/small) I took a snapshot (called test1) on the production cluster, ran export-diff from that snapshot, and then attempted to import-diff the diff file on the backup cluster. # rbd import-diff ./foo.diff backup/small start snapshot 'test1' does not exist in the image, aborting Importing image diff: 0% complete...failed. rbd: import-diff failed: (22) Invalid argument This works fine if I create a test1 snapshot on the backup cluster before running import-diff. However, it appears that the changes get written into backup/small not backup/small@test1. So unless I'm not understanding something, it seems like the content of the snapshot on the backup cluster is of no importance, which makes me wonder why it must exist at all. Any thoughts? Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 85% of the cluster won't start, or how I learned why to use disk UUIDs
someone from making the same mistakes! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph as a primary storage for owncloud
I tried this a while back. In my setup, I exposed an block device with rbd on the owncloud host and tried sharing an image to the owncloud host via NFS. If I recall correctly, both worked fine (I didn't try S3). The problem I had at the time (maybe 6-12 months ago) was that owncloud didn't support enough automated management of LDAP group permissions for me to easily deploy and manage it for 1000+ users. It is on my list of things to revisit however, so I'd be curious to hear how things go for you. If it doesn't work out, I'd also recommend checking out Pydio. It didn't make it into production in my environment (I didn't have time to focus on it), but I liked its user management better than owncloud's at the time. -Steve On 01/27/2015 05:05 AM, Simone Spinelli wrote: Dear all, we would like to use ceph as a a primary (object) storage for owncloud. Did anyone already do this? I mean: is that actually possible or am I wrong? As I understood I have to use radosGW in swift flavor, but what about s3 flavor? I cannot find anything official so hence my question. Do you have any advice or can you indicate me some kind of documentation/how-to? I know that maybe this is not the right place for this questions but I also asked owncloud's community... in the meantime... Every answer is appreciated! Thanks Simone -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd troubleshooting
Shiva, You need to connect to the host where the OSD is located and stop it by invoking: service stop ceph osd.1 I don't think there's a way to stop and start OSDs from an admin node, unless I missed a change that provides this functionality. -Steve On 11/04/2014 10:59 PM, shiva rkreddy wrote: Hi, I'm trying to run osd troubleshooting commands. *Use case: Stopping osd without re-balancing.* .#ceph osd noout // this command works. But, neither of the following work: #stop ceph-osd id=1 (Error message: /*no valid command found; 10 closest matches:*/ ...) or # ceph osd stop osd.1 ( Error message: /*stop: Unknown job: ceph-osd*/ ) Environment: ceph: 0.80.7 OS: RHEL6.5 upstart-0.6.5-13.el6_5.3.x86_64 ceph-0.80.7-0.el6.x86_64 ceph-common-0.80.7-0.el6.x86_64 Thanks, shiva ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journals relabeled by OS, symlinks broken
Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using /dev/disk/by-id for future ODSs. I tried stopping an existing OSD on another node (which is working - osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal to point to the same partition using /dev/disk/by-id, and starting the OSD again, but it fails to start with: 2014-10-27 11:03:31.607060 7fa65018e780 -1 filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal /var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory 2014-10-27 11:03:31.617262 7fa65018e780 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-33: (2) No such file or directory The journal symlink exists and points to the same partition as before when it was /dev/sde1. Can I not change these existing symlinks manually to point to the same partition using /dev/disk/by-id? -Steve On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote: * /dev/disk/by-id by-path will change if you connect it to different controller, or replace your controller with other model, or put it in different pci slot On Sat, 25 Oct 2014 17:20:58 +, Scott Laird sc...@sigkill.org wrote: You'd be best off using /dev/disk/by-path/ or similar links; that way they follow the disks if they're renamed again. On Fri, Oct 24, 2014, 9:40 PM Steve Anthony sma...@lehigh.edu wrote: Hello, I was having problems with a node in my cluster (Ceph v0.80.7/Debian Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when it came back up. Now all the symlinks to the journals are broken. The SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde: root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal - /dev/sde1 lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal - /dev/sdd1 lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal - /dev/sdc1 lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal - /dev/sde2 lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal - /dev/sdc2 lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal - /dev/sdd2 lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal - /dev/sde3 lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal - /dev/sdc3 lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal - /dev/sdd3 lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal - /dev/sde4 lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal - /dev/sdd4 Any way to fix this without just removing all the OSDs and re-adding them? I thought about recreating the symlinks to point at the new SSD labels, but I figured I'd check here first. Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journals relabeled by OS, symlinks broken
Oh, hey look at that. I must have screwed something up before. I thought it was strange that it didn't work. Works now, thanks! -Steve On 10/27/2014 03:20 PM, Scott Laird wrote: Double-check that you did it right. Does 'ls -lL /var/lib/ceph/osd/ceph-33/journal' resolve to a block-special device? On Mon Oct 27 2014 at 12:12:20 PM Steve Anthony sma...@lehigh.edu mailto:sma...@lehigh.edu wrote: Nice. Thanks all, I'll adjust my scripts to call ceph-deploy using /dev/disk/by-id for future ODSs. I tried stopping an existing OSD on another node (which is working - osd.33 in this case), changing /var/lib/ceph/osd/ceph-33/journal to point to the same partition using /dev/disk/by-id, and starting the OSD again, but it fails to start with: 2014-10-27 11:03:31.607060 7fa65018e780 -1 filestore(/var/lib/ceph/osd/ceph-33) mount failed to open journal /var/lib/ceph/osd/ceph-33/journal: (2) No such file or directory 2014-10-27 11:03:31.617262 7fa65018e780 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-33: (2) No such file or directory The journal symlink exists and points to the same partition as before when it was /dev/sde1. Can I not change these existing symlinks manually to point to the same partition using /dev/disk/by-id? -Steve On 10/27/2014 12:44 PM, Mariusz Gronczewski wrote: * /dev/disk/by-id by-path will change if you connect it to different controller, or replace your controller with other model, or put it in different pci slot On Sat, 25 Oct 2014 17:20:58 +, Scott Laird sc...@sigkill.org mailto:sc...@sigkill.org wrote: You'd be best off using /dev/disk/by-path/ or similar links; that way they follow the disks if they're renamed again. On Fri, Oct 24, 2014, 9:40 PM Steve Anthony sma...@lehigh.edu mailto:sma...@lehigh.edu wrote: Hello, I was having problems with a node in my cluster (Ceph v0.80.7/Debian Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when it came back up. Now all the symlinks to the journals are broken. The SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde: root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal - /dev/sde1 lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal - /dev/sdd1 lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal - /dev/sdc1 lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal - /dev/sde2 lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal - /dev/sdc2 lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal - /dev/sdd2 lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal - /dev/sde3 lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal - /dev/sdc3 lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal - /dev/sdd3 lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal - /dev/sde4 lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal - /dev/sdd4 Any way to fix this without just removing all the OSDs and re-adding them? I thought about recreating the symlinks to point at the new SSD labels, but I figured I'd check here first. Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu mailto:sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu mailto:sma...@lehigh.edu -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] journals relabeled by OS, symlinks broken
Hello, I was having problems with a node in my cluster (Ceph v0.80.7/Debian Wheezy/Kernel 3.12), so I rebooted it and the disks were relabled when it came back up. Now all the symlinks to the journals are broken. The SSDs are now sda, sdb, and sdc but the journals were sdc, sdd, and sde: root@ceph17:~# ls -l /var/lib/ceph/osd/ceph-*/journal lrwxrwxrwx 1 root root 9 Oct 20 16:47 /var/lib/ceph/osd/ceph-150/journal - /dev/sde1 lrwxrwxrwx 1 root root 9 Oct 20 16:53 /var/lib/ceph/osd/ceph-157/journal - /dev/sdd1 lrwxrwxrwx 1 root root 9 Oct 21 08:31 /var/lib/ceph/osd/ceph-164/journal - /dev/sdc1 lrwxrwxrwx 1 root root 9 Oct 21 16:33 /var/lib/ceph/osd/ceph-171/journal - /dev/sde2 lrwxrwxrwx 1 root root 9 Oct 22 10:50 /var/lib/ceph/osd/ceph-178/journal - /dev/sdc2 lrwxrwxrwx 1 root root 9 Oct 22 15:48 /var/lib/ceph/osd/ceph-184/journal - /dev/sdd2 lrwxrwxrwx 1 root root 9 Oct 23 10:46 /var/lib/ceph/osd/ceph-191/journal - /dev/sde3 lrwxrwxrwx 1 root root 9 Oct 23 15:22 /var/lib/ceph/osd/ceph-195/journal - /dev/sdc3 lrwxrwxrwx 1 root root 9 Oct 23 16:59 /var/lib/ceph/osd/ceph-201/journal - /dev/sdd3 lrwxrwxrwx 1 root root 9 Oct 24 21:32 /var/lib/ceph/osd/ceph-214/journal - /dev/sde4 lrwxrwxrwx 1 root root 9 Oct 24 21:33 /var/lib/ceph/osd/ceph-215/journal - /dev/sdd4 Any way to fix this without just removing all the OSDs and re-adding them? I thought about recreating the symlinks to point at the new SSD labels, but I figured I'd check here first. Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] get amount of space used by snapshots
Hello, If I have an rbd image and a series of snapshots of that image, is there a fast way to determine how much space the objects composing the original image and all the snapshots are using in the cluster, or even just the space used by the snaps? The only way I've been able to find so far is to get the block_name_prefix for the image with rbd info and then grep for that prefix in the output of rados ls, eg. rados ls|grep rb.0.396de.238e1f29|wc -l. This is relatively slow, printing ~250 objects/s, which means hours to count through 10s of TB of objects. Basically, if I'm keeping daily snapshots for a set of images, I'd like to be able to tell how much space those snapshots are using so I can determine how frequently I need to prune old snaps. Thanks! -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)
Ok, after some delays and the move to new network hardware I have an update. I'm still seeing the same low bandwidth and high retransmissions from iperf after moving to the Cisco 6001 (10Gb) and 2960 (1Gb). I've narrowed it down to transmissions from a 10Gb connected host to a 1Gb connected host. Taking a more targeted tcpdump, I discovered that there are multiple duplicate ACKs, triggering fast retransmissions between the two test hosts. There are several websites/articles which suggest that mixing 10Gb and 1Gb hosts causes performance issues, but no concrete explanation of why that's the case, and if it can be avoided without moving everything to 10Gb, eg. http://blogs.technet.com/b/networking/archive/2011/05/16/tcp-dupacks-and-tcp-fast-retransmits.aspx http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19856911/download.aspx [PDF] http://packetpushers.net/flow-control-storm-%E2%80%93-ip-storage-performance-effects/ I verified that it's not a flow control storm (the pause frame counters along the network path are zero), so assuming it might be bandwidth related I installed trickle and used it to limit the bandwidth of iperf to 1Gb; no change. I further restricted it down to 100Kbps, and was *still* seeing high retransmission. This seems to imply it's not purely bandwidth related. After further research, I noticed a difference of about 0.1ms in the RTT between two 10Gb hosts (intra-switch) and the 10Gb and 1Gb host (inter-switch). I theorized this may be affecting the retransmission timeout counter calculations, per: http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html so I used ethtool to set the link plugged into the 10Gb 6001 to 1Gb; this immediately fixed the issue. After this change the difference in RTTs moved to about .025ms. Plugging another host into the old 10Gb FEX, I have 10Gb to 10Gb RTTs withing .001ms of 6001 to 2960 RTTs, and don't see the high retransmissions with iperf between those 10Gb hosts. tldr So, right now I don't see retransmissions between hosts on the same switch (even if speeds are mixed), but I do across switches when the hosts are mixed 10Gb/1Gb. Also, I wonder what the difference between process bandwidth limiting and link 1Gb negotiation is which leads to the differences observed. I checked the link per Mark's suggestion below, but all the values they increase in that old post are already lower than the defaults set on my hosts. If anyone has any ideas or explanations, I'd appreciate it. Otherwise, I'll keep the list posted if I uncover a solution or make more progress. Thanks. -Steve On 07/28/2014 01:21 PM, Mark Nelson wrote: On 07/28/2014 11:28 AM, Steve Anthony wrote: While searching for more information I happened across the following post (http://dachary.org/?p=2961) which vaguely resembled the symptoms I've been experiencing. I ran tcpdump and noticed what appeared to be a high number of retransmissions on the host where the images are mounted during a read from a Ceph rbd, so I ran iperf3 to get some concrete numbers: Very interesting that you are seeing retransmissions. Server: nas4 (where rbd images are mapped) Client: ceph2 (currently not in the cluster, but configured identically to the other nodes) Start server on nas4: iperf3 -s On ceph2, connect to server nas4, send 4096MB of data, report on 1 second intervals. Add -R to reverse the client/server roles. iperf3 -c nas4 -n 4096M -i 1 Summary of traffic going out the 1Gb interface to a switch [ ID] Interval Transfer Bandwidth Retr [ 5] 0.00-36.53 sec 4.00 GBytes 941 Mbits/sec 15 sender [ 5] 0.00-36.53 sec 4.00 GBytes 940 Mbits/sec receiver Reversed, summary of traffic going over the fabric extender [ ID] Interval Transfer Bandwidth Retr [ 5] 0.00-80.84 sec 4.00 GBytes 425 Mbits/sec 30756 sender [ 5] 0.00-80.84 sec 4.00 GBytes 425 Mbits/sec receiver Definitely looks suspect! It appears that the issue is related to the network topology employed. The private cluster network and nas4's public interface are both connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a Nexus 7000. This was meant as a temporary solution until our network team could finalize their design and bring up the Nexus 6001 for the cluster. From what our network guys have said, the FEX has been much more limited than they anticipated and they haven't been pleased with it as a solution in general. The 6001 is supposed be ready this week, so once it's online I'll move the cluster to that switch and re-test to see if this fixes the issues I've been experiencing. If it's not the hardware, one other thing you might want to test is to make sure it's not something similar to the autotuning issues we used to see. I don't think this should be an issue at this point given the code changes we made to address it, but it would be easy to test. Doesn't seem like
Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)
Thanks for the information! Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I was under the impression that rbd cache options wouldn't apply, since presumably the kernel is handling the caching. I'll have to toggle some of those values and see it they make a difference in my setup. I did some additional testing today. If I limit the write benchmark to 1 concurrent operation I see a lower bandwidth number, as I expected. However, when writing to the XFS filesystem on an rbd image I see transfer rates closer to to 400MB/s. # rados -p bench bench 300 write --no-cleanup -t 1 Total time run: 300.105945 Total writes made: 1992 Write size: 4194304 Bandwidth (MB/sec): 26.551 Stddev Bandwidth: 5.69114 Max bandwidth (MB/sec): 40 Min bandwidth (MB/sec): 0 Average Latency:0.15065 Stddev Latency: 0.0732024 Max latency:0.617945 Min latency:0.097339 # time cp -a /mnt/local/climate /mnt/ceph_test1 real2m11.083s user0m0.440s sys1m11.632s # du -h --max-deph=1 /mnt/local 53G/mnt/local/climate This seems to imply that the there is more than one concurrent operation when writing into the filesystem on top of the rbd image. However, given that the filesystem read speeds and the rados benchmark read speeds are much closer in reported bandwidth, it's as if reads are occurring as a single operation. # time cp -a /mnt/ceph_test2/isos /mnt/local/ real36m2.129s user0m1.572s sys3m23.404s # du -h --max-deph=1 /mnt/ceph_test2/ 68G/mnt/ceph_test2/isos Is this apparent single-thread read and multi-thread write with the rbd kernel module the expected mode of operation? If so, could someone explain the reason for this limitation? Based on the information on data striping in http://ceph.com/docs/next/architecture/#data-striping I would assume that a format 1 image would stripe a file larger than the 4MB object size over multiple objects and that those objects would be distributed over multiple OSDs. This would seem to indicate that reading a file back would be much faster since even though Ceph is only reading the primary replica, the read is still distributed over multiple OSDs. At worst I would expect something near the read bandwidth of a single OSD, which would still be much higher than 30-40MB/s. -Steve On 07/24/2014 04:07 PM, Udo Lembke wrote: Hi Steve, I'm also looking for improvements of single-thread-reads. A little bit higher values (twice?) should be possible with your config. I have 5 nodes with 60 4-TB hdds and got following: rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup Total time run:60.066934 Total reads made: 863 Read size:4194304 Bandwidth (MB/sec):57.469 Average Latency: 0.0695964 Max latency: 0.434677 Min latency: 0.016444 In my case I had some osds (xfs) with an high fragmentation (20%). Changing the mount options and defragmentation help slightly. Performance changes: [client] rbd cache = true rbd cache writethrough until flush = true [osd] osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M osd_op_threads = 4 osd_disk_threads = 4 But I expect much more speed for an single thread... Udo On 23.07.2014 22:13, Steve Anthony wrote: Ah, ok. That makes sense. With one concurrent operation I see numbers more in line with the read speeds I'm seeing from the filesystems on the rbd images. # rados -p bench bench 300 seq --no-cleanup -t 1 Total time run:300.114589 Total reads made: 2795 Read size:4194304 Bandwidth (MB/sec):37.252 Average Latency: 0.10737 Max latency: 0.968115 Min latency: 0.039754 # rados -p bench bench 300 rand --no-cleanup -t 1 Total time run:300.164208 Total reads made: 2996 Read size:4194304 Bandwidth (MB/sec):39.925 Average Latency: 0.100183 Max latency: 1.04772 Min latency: 0.039584 I really wish I could find my data on read speeds from a couple weeks ago. It's possible that they've always been in this range, but I remember one of my test users saturating his 1GbE link over NFS reading copying from the rbd client to his workstation. Of course, it's also possible that the data set he was using was cached in RAM when he was testing, masking the lower rbd speeds. It just seems counterintuitive to me that read speeds would be so much slower that writes at the filesystem layer in practice. With images in the 10-100TB range, reading data at 20-60MB/s isn't going to be pleasant. Can you suggest any tunables or other approaches