Re: [ceph-users] Power outages!!! help!

David Turner Tue, 29 Aug 2017 11:05:34 -0700

But it was absolutely awesome to run an osd off of an rbd after the disk
failed.


On Tue, Aug 29, 2017, 1:42 PM David Turner <drakonst...@gmail.com> wrote:

> To addend Steve's success, the rbd was created in a second cluster in the
> same datacenter so it didn't run the risk of deadlocking that mapping rbds
> on machines running osds has.  It is still theoretical to work on the same
> cluster, but more inherently dangerous for a few reasons.
>
> On Tue, Aug 29, 2017, 1:15 PM Steve Taylor <steve.tay...@storagecraft.com>
> wrote:
>
>> Hong,
>>
>> Probably your best chance at recovering any data without special,
>> expensive, forensic procedures is to perform a dd from /dev/sdb to
>> somewhere else large enough to hold a full disk image and attempt to
>> repair that. You'll want to use 'conv=noerror' with your dd command
>> since your disk is failing. Then you could either re-attach the OSD
>> from the new source or attempt to retrieve objects from the filestore
>> on it.
>>
>> I have actually done this before by creating an RBD that matches the
>> disk size, performing the dd, running xfs_repair, and eventually
>> adding it back to the cluster as an OSD. RBDs as OSDs is certainly a
>> temporary arrangement for repair only, but I'm happy to report that it
>> worked flawlessly in my case. I was able to weight the OSD to 0,
>> offload all of its data, then remove it for a full recovery, at which
>> point I just deleted the RBD.
>>
>> The possibilities afforded by Ceph inception are endless. ☺
>>
>>
>>
>> Steve Taylor | Senior Software Engineer | StorageCraft Technology
>> Corporation
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2799 |
>>
>> If you are not the intended recipient of this message or received it
>> erroneously, please notify the sender and delete it, together with any
>> attachments, and be advised that any dissemination or copying of this
>> message is prohibited.
>>
>>
>>
>> On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote:
>> > Rule of thumb with batteries is:
>> > - more “proper temperature” you run them at the more life you get out
>> > of them
>> > - more battery is overpowered for your application the longer it will
>> > survive.
>> >
>> > Get your self a LSI 94** controller and use it as HBA and you will be
>> > fine. but get MORE DRIVES !!!!! …
>> > > On 28 Aug 2017, at 23:10, hjcho616 <hjcho...@yahoo.com> wrote:
>> > >
>> > > Thank you Tomasz and Ronny.  I'll have to order some hdd soon and
>> > > try these out.  Car battery idea is nice!  I may try that.. =)  Do
>> > > they last longer?  Ones that fit the UPS original battery spec
>> > > didn't last very long... part of the reason why I gave up on them..
>> > > =P  My wife probably won't like the idea of car battery hanging out
>> > > though ha!
>> > >
>> > > The OSD1 (one with mostly ok OSDs, except that smart failure)
>> > > motherboard doesn't have any additional SATA connectors available.
>> > >  Would it be safe to add another OSD host?
>> > >
>> > > Regards,
>> > > Hong
>> > >
>> > >
>> > >
>> > > On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g
>> > > mail.com> wrote:
>> > >
>> > >
>> > > Sorry for being brutal … anyway
>> > > 1. get the battery for UPS ( a car battery will do as well, I’ve
>> > > moded on ups in the past with truck battery and it was working like
>> > > a charm :D )
>> > > 2. get spare drives and put those in because your cluster CAN NOT
>> > > get out of error due to lack of space
>> > > 3. Follow advice of Ronny Aasen on hot to recover data from hard
>> > > drives
>> > > 4 get cooling to drives or you will loose more !
>> > >
>> > >
>> > > > On 28 Aug 2017, at 22:39, hjcho616 <hjcho...@yahoo.com> wrote:
>> > > >
>> > > > Tomasz,
>> > > >
>> > > > Those machines are behind a surge protector.  Doesn't appear to
>> > > > be a good one!  I do have a UPS... but it is my fault... no
>> > > > battery.  Power was pretty reliable for a while... and UPS was
>> > > > just beeping every chance it had, disrupting some sleep.. =P  So
>> > > > running on surge protector only.  I am running this in home
>> > > > environment.   So far, HDD failures have been very rare for this
>> > > > environment. =)  It just doesn't get loaded as much!  I am not
>> > > > sure what to expect, seeing that "unfound" and just a feeling of
>> > > > possibility of maybe getting OSD back made me excited about it.
>> > > > =) Thanks for letting me know what should be the priority.  I
>> > > > just lack experience and knowledge in this. =) Please do continue
>> > > > to guide me though this.
>> > > >
>> > > > Thank you for the decode of that smart messages!  I do agree that
>> > > > looks like it is on its way out.  I would like to know how to get
>> > > > good portion of it back if possible. =)
>> > > >
>> > > > I think I just set the size and min_size to 1.
>> > > > # ceph osd lspools
>> > > > 0 data,1 metadata,2 rbd,
>> > > > # ceph osd pool set rbd size 1
>> > > > set pool 2 size to 1
>> > > > # ceph osd pool set rbd min_size 1
>> > > > set pool 2 min_size to 1
>> > > >
>> > > > Seems to be doing some backfilling work.
>> > > >
>> > > > # ceph health
>> > > > HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 2
>> > > > pgs backfill_toofull; 74 pgs backfill_wait; 3 pgs backfilling;
>> > > > 108 pgs degraded; 6 pgs down; 6 pgs inconsistent; 6 pgs peering;
>> > > > 7 pgs recovery_wait; 16 pgs stale; 108 pgs stuck degraded; 6 pgs
>> > > > stuck inactive; 16 pgs stuck stale; 130 pgs stuck unclean; 101
>> > > > pgs stuck undersized; 101 pgs undersized; 1 requests are blocked
>> > > > > 32 sec; recovery 1790657/4502340 objects degraded (39.772%);
>> > > > recovery 641906/4502340 objects misplaced (14.257%); recovery
>> > > > 147/2251990 unfound (0.007%); 50 scrub errors; mds cluster is
>> > > > degraded; no legacy OSD present but 'sortbitwise' flag is not set
>> > > >
>> > > >
>> > > >
>> > > > Regards,
>> > > > Hong
>> > > >
>> > > >
>> > > > On Monday, August 28, 2017 4:18 PM, Tomasz Kusmierz <tom.kusmierz
>> > > > @gmail.com> wrote:
>> > > >
>> > > >
>> > > > So to decode few things about your disk:
>> > > >
>> > > >   1 Raw_Read_Error_Rate    0x002f  100  100  051    Pre-fail
>> > > > Always      -      37
>> > > > 37 read erros and only one sector marked as pending - fun disk
>> > > > :/
>> > > >
>> > > > 181 Program_Fail_Cnt_Total  0x0022  099  099  000    Old_age
>> > > > Always      -      35325174
>> > > > So firmware has quite few bugs, that’s nice
>> > > >
>> > > > 191 G-Sense_Error_Rate      0x0022  100  100  000    Old_age
>> > > > Always      -      2855
>> > > > disk was thrown around while operational even more nice.
>> > > >
>> > > > 194 Temperature_Celsius    0x0002  047  041  000    Old_age
>> > > > Always      -      53 (Min/Max 15/59)
>> > > > if your disk passes 50 you should not consider using it, high
>> > > > temperatures demagnetise plate layer and you will see more errors
>> > > > in very near future.
>> > > >
>> > > > 197 Current_Pending_Sector  0x0032  100  100  000    Old_age
>> > > > Always      -      1
>> > > > as mentioned before :)
>> > > >
>> > > > 200 Multi_Zone_Error_Rate  0x002a  100  100  000    Old_age
>> > > > Always      -      4222
>> > > > your heads keep missing tracks … bent ? I don’t even know how to
>> > > > comment here.
>> > > >
>> > > >
>> > > > generally fun drive you’ve got there … rescue as much as you can
>> > > > and throw it away !!!
>> > > >
>> > > >
>> > >
>> > >
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

Reply via email to