Re: [ceph-users] Proper procedure to replace DB/WAL SSD

Dietmar Rieder Tue, 27 Feb 2018 08:14:15 -0800

... however, it would be nice if ceph-volume would also create the
partitions for the WAL and/or DB if needed. Is there a special reason,
why this is not implemented?


Dietmar


On 02/27/2018 04:25 PM, David Turner wrote:
> Gotcha.  As a side note, that setting is only used by ceph-disk as
> ceph-volume does not create partitions for the WAL or DB.  You need to
> create those partitions manually if using anything other than a whole
> block device when creating OSDs with ceph-volume.
> 
> On Tue, Feb 27, 2018 at 8:20 AM Caspar Smit <caspars...@supernas.eu
> <mailto:caspars...@supernas.eu>> wrote:
> 
>     David,
> 
>     Yes i know, i use 20GB partitions for 2TB disks as journal. It was
>     just to inform other people that Ceph's default of 1GB is pretty low.
>     Now that i read my own sentence it indeed looks as if i was using
>     1GB partitions, sorry for the confusion.
> 
>     Caspar
> 
>     2018-02-27 14:11 GMT+01:00 David Turner <drakonst...@gmail.com
>     <mailto:drakonst...@gmail.com>>:
> 
>         If you're only using a 1GB DB partition, there is a very real
>         possibility it's already 100% full. The safe estimate for DB
>         size seams to be 10GB/1TB so for a 4TB osd a 40GB DB should work
>         for most use cases (except loads and loads of small files).
>         There are a few threads that mention how to check how much of
>         your DB partition is in use. Once it's full, it spills over to
>         the HDD.
> 
> 
>         On Tue, Feb 27, 2018, 6:19 AM Caspar Smit
>         <caspars...@supernas.eu <mailto:caspars...@supernas.eu>> wrote:
> 
>             2018-02-26 23:01 GMT+01:00 Gregory Farnum
>             <gfar...@redhat.com <mailto:gfar...@redhat.com>>:
> 
>                 On Mon, Feb 26, 2018 at 3:23 AM Caspar Smit
>                 <caspars...@supernas.eu <mailto:caspars...@supernas.eu>>
>                 wrote:
> 
>                     2018-02-24 7:10 GMT+01:00 David Turner
>                     <drakonst...@gmail.com <mailto:drakonst...@gmail.com>>:
> 
>                         Caspar, it looks like your idea should work.
>                         Worst case scenario seems like the osd wouldn't
>                         start, you'd put the old SSD back in and go back
>                         to the idea to weight them to 0, backfilling,
>                         then recreate the osds. Definitely with a try in
>                         my opinion, and I'd love to hear your experience
>                         after.
> 
> 
>                     Hi David,
> 
>                     First of all, thank you for ALL your answers on this
>                     ML, you're really putting a lot of effort into
>                     answering many questions asked here and very often
>                     they contain invaluable information.
> 
> 
>                     To follow up on this post i went out and built a
>                     very small (proxmox) cluster (3 OSD's per host) to
>                     test my suggestion of cloning the DB/WAL SDD. And it
>                     worked!
>                     Note: this was on Luminous v12.2.2 (all bluestore,
>                     ceph-disk based OSD's)
> 
>                     Here's what i did on 1 node:
> 
>                     1) ceph osd set noout
>                     2) systemctl stop osd.0; systemctl stop
>                     osd.1; systemctl stop osd.2
>                     3) ddrescue -f -n -vv <old SSD dev> <new SSD dev>
>                     /root/clone-db.log
>                     4) removed the old SSD physically from the node
>                     5) checked with "ceph -s" and already saw HEALTH_OK
>                     and all OSD's up/in
>                     6) ceph osd unset noout
> 
>                     I assume that once the ddrescue step is finished a
>                     'partprobe' or something similar is triggered and
>                     udev finds the DB partitions on the new SSD and
>                     starts the OSD's again (kind of what happens during
>                     hotplug)
>                     So it is probably better to clone the SSD in another
>                     (non-ceph) system to not trigger any udev events.
> 
>                     I also tested a reboot after this and everything
>                     still worked.
> 
> 
>                     The old SSD was 120GB and the new is 256GB (cloning
>                     took around 4 minutes)
>                     Delta of data was very low because it was a test
>                     cluster.
> 
>                     All in all the OSD's in question were 'down' for
>                     only 5 minutes (so i stayed within the
>                     ceph_osd_down_out interval of the default 10 minutes
>                     and didn't actually need to set noout :)
> 
> 
>                 I kicked off a brief discussion about this with some of
>                 the BlueStore guys and they're aware of the problem with
>                 migrating across SSDs, but so far it's just a Trello
>                 card: 
> https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db
>                 They do confirm you should be okay with dd'ing things
>                 across, assuming symlinks get set up correctly as David
>                 noted.
> 
>              
>             Great that it is on the radar to address. This method feels
>             hacky.
>              
> 
>                 I've got some other bad news, though: BlueStore has
>                 internal metadata about the size of the block device
>                 it's using, so if you copy it onto a larger block
>                 device, it will not actually make use of the additional
>                 space. :(
>                 -Greg
> 
> 
>             Yes, i was well aware of that, no problem. The reason was
>             the smaller SSD sizes are simply not being made anymore or
>             discontinued by the manufacturer.
>             Would be nice though if the DB size could be resized in the
>             future, the default 1GB DB size seems very small to me. 
> 
>             Caspar
>              
> 
>                  
> 
> 
>                     Kind regards,
>                     Caspar
> 
>                      
> 
>                         Nico, it is not possible to change the WAL or DB
>                         size, location, etc after osd creation. If you
>                         want to change the configuration of the osd
>                         after creation, you have to remove it from the
>                         cluster and recreate it. There is no similar
>                         functionality to how you could move, recreate,
>                         etc filesystem osd journals. I think this might
>                         be on the radar as a feature, but I don't know
>                         for certain. I definitely consider it to be a
>                         regression of bluestore.
> 
> 
> 
> 
>                         On Fri, Feb 23, 2018, 9:13 AM Nico Schottelius
>                         <nico.schottel...@ungleich.ch
>                         <mailto:nico.schottel...@ungleich.ch>> wrote:
> 
> 
>                             A very interesting question and I would add
>                             the follow up question:
> 
>                             Is there an easy way to add an external
>                             DB/WAL devices to an existing
>                             OSD?
> 
>                             I suspect that it might be something on the
>                             lines of:
> 
>                             - stop osd
>                             - create a link in
>                             ...ceph/osd/ceph-XX/block.db to the target
>                             device
>                             - (maybe run some kind of osd mkfs ?)
>                             - start osd
> 
>                             Has anyone done this so far or
>                             recommendations on how to do it?
> 
>                             Which also makes me wonder: what is actually
>                             the format of WAL and
>                             BlockDB in bluestore? Is there any
>                             documentation available about it?
> 
>                             Best,
> 
>                             Nico
> 
> 
>                             Caspar Smit <caspars...@supernas.eu
>                             <mailto:caspars...@supernas.eu>> writes:
> 
>                             > Hi All,
>                             >
>                             > What would be the proper way to
>                             preventively replace a DB/WAL SSD (when it
>                             > is nearing it's DWPD/TBW limit and not
>                             failed yet).
>                             >
>                             > It hosts DB partitions for 5 OSD's
>                             >
>                             > Maybe something like:
>                             >
>                             > 1) ceph osd reweight 0 the 5 OSD's
>                             > 2) let backfilling complete
>                             > 3) destroy/remove the 5 OSD's
>                             > 4) replace SSD
>                             > 5) create 5 new OSD's with seperate DB
>                             partition on new SSD
>                             >
>                             > When these 5 OSD's are big HDD's (8TB) a
>                             LOT of data has to be moved so i
>                             > thought maybe the following would work:
>                             >
>                             > 1) ceph osd set noout
>                             > 2) stop the 5 OSD's (systemctl stop)
>                             > 3) 'dd' the old SSD to a new SSD of same
>                             or bigger size
>                             > 4) remove the old SSD
>                             > 5) start the 5 OSD's (systemctl start)
>                             > 6) let backfilling/recovery complete (only
>                             delta data between OSD stop and
>                             > now)
>                             > 6) ceph osd unset noout
>                             >
>                             > Would this be a viable method to replace a
>                             DB SSD? Any udev/serial nr/uuid
>                             > stuff preventing this to work?
>                             >
>                             > Or is there another 'less hacky' way to
>                             replace a DB SSD without moving too
>                             > much data?
>                             >
>                             > Kind regards,
>                             > Caspar
>                             >
>                             _______________________________________________
>                             > ceph-users mailing list
>                             > ceph-users@lists.ceph.com
>                             <mailto:ceph-users@lists.ceph.com>
>                             >
>                             
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
>                             --
>                             Modern, affordable, Swiss Virtual Machines.
>                             Visit www.datacenterlight.ch
>                             <http://www.datacenterlight.ch>
>                             _______________________________________________
>                             ceph-users mailing list
>                             ceph-users@lists.ceph.com
>                             <mailto:ceph-users@lists.ceph.com>
>                             
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>                     _______________________________________________
>                     ceph-users mailing list
>                     ceph-users@lists.ceph.com
>                     <mailto:ceph-users@lists.ceph.com>
>                     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at

signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proper procedure to replace DB/WAL SSD

Reply via email to