Seems like a message bus would be nice. Each opener of an RBD could subscribe for messages on the bus for that RBD. Anytime the map is modified a message could be put on the bus to update the others. That opens up a whole other can of worms though.
Robert LeBlanc Sent from a mobile device please excuse any typos. On Jan 6, 2015 5:35 PM, "Josh Durgin" <josh.dur...@inktank.com> wrote: > On 01/06/2015 04:19 PM, Robert LeBlanc wrote: > >> The bitmap certainly sounds like it would help shortcut a lot of code >> that Xiaoxi mentions. Is the idea that the client caches the bitmap >> for the RBD so it know which OSDs to contact (thus saving a round trip >> to the OSD), or only for the OSD to know which objects exist on it's >> disk? >> > > It's purely at the rbd level, so librbd caches it and maintains its > consistency. The idea is that since it's kept consistent, librbd can do > things like delete exactly the objects that exist without any > extra communication with the osds. Many things that were > O(size of image) become O(written objects in image). > > The only restriction is that keeping the object map consistent requires > a single writer, so this does not work for the rare case of e.g. ocfs2 > on top of rbd, where there are multiple clients writing to the same > rbd image at once. > > Josh > > On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin <josh.dur...@inktank.com> >> wrote: >> >>> On 01/06/2015 10:24 AM, Robert LeBlanc wrote: >>> >>>> >>>> Can't this be done in parallel? If the OSD doesn't have an object then >>>> it is a noop and should be pretty quick. The number of outstanding >>>> operations can be limited to 100 or a 1000 which would provide a >>>> balance between speed and performance impact if there is data to be >>>> trimmed. I'm not a big fan of a "--skip-trimming" option as there is >>>> the potential to leave some orphan objects that may not be cleaned up >>>> correctly. >>>> >>> >>> >>> Yeah, a --skip-trimming option seems a bit dangerous. This trimming >>> actually is parallelized (10 ops at once by default, changeable via >>> --rbd-concurrent-management-ops) since dumpling. >>> >>> What will really help without being dangerous is keeping a map of >>> object existence [1]. This will avoid any unnecessary trimming >>> automatically, and it should be possible to add to existing images. >>> It should be in hammer. >>> >>> Josh >>> >>> [1] https://github.com/ceph/ceph/pull/2700 >>> >>> >>> On Tue, Jan 6, 2015 at 8:09 AM, Jake Young <jak3...@gmail.com> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Monday, January 5, 2015, Chen, Xiaoxi <xiaoxi.c...@intel.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> When you shrinking the RBD, most of the time was spent on >>>>>> librbd/internal.cc::trim_image(), in this function, client will >>>>>> iterator >>>>>> all >>>>>> unnecessary objects(no matter whether it exists) and delete them. >>>>>> >>>>>> >>>>>> >>>>>> So in this case, when Edwin shrinking his RBD from 650PB to 650GB, >>>>>> there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object = >>>>>> 170,227,200 Objects need to be deleted.That will definitely take a >>>>>> long >>>>>> time >>>>>> since rbd client need to send a delete request to OSD, OSD need to >>>>>> find >>>>>> out >>>>>> the object context and delete(or doesn’t exist at all). The time >>>>>> needed >>>>>> to >>>>>> trim an image is ratio to the size needed to trim. >>>>>> >>>>>> >>>>>> >>>>>> make another image of the correct size and copy your VM's file system >>>>>> to >>>>>> the new image, then delete the old one will NOT help in general, just >>>>>> because delete the old volume will take exactly the same time as >>>>>> shrinking , >>>>>> they both need to call trim_image(). >>>>>> >>>>>> >>>>>> >>>>>> The solution in my mind may be we can provide a “—skip-triming” flag >>>>>> to >>>>>> skip the trimming. When the administrator absolutely sure there is no >>>>>> written have taken place in the shrinking area(that means there is no >>>>>> object >>>>>> created in these area), they can use this flag to skip the time >>>>>> consuming >>>>>> trimming. >>>>>> >>>>>> >>>>>> >>>>>> How do you think? >>>>>> >>>>> >>>>> >>>>> >>>>> That sounds like a good solution. Like doing "undo grow image" >>>>> >>>>> >>>>> >>>>>> >>>>>> From: Jake Young [mailto:jak3...@gmail.com] >>>>>> Sent: Monday, January 5, 2015 9:45 PM >>>>>> To: Chen, Xiaoxi >>>>>> Cc: Edwin Peer; ceph-users@lists.ceph.com >>>>>> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sunday, January 4, 2015, Chen, Xiaoxi <xiaoxi.c...@intel.com> >>>>>> wrote: >>>>>> >>>>>> You could use rbd info <volume_name> to see the block_name_prefix, >>>>>> the >>>>>> object name consist like <block_name_prefix>.<sequence_number>, so >>>>>> for >>>>>> example, rb.0.ff53.3d1b58ba.00000000e6ad should be the <e6ad>th >>>>>> object >>>>>> of >>>>>> the volume with block_name_prefix rb.0.ff53.3d1b58ba. >>>>>> >>>>>> $ rbd info huge >>>>>> rbd image 'huge': >>>>>> size 1024 TB in 268435456 objects >>>>>> order 22 (4096 kB objects) >>>>>> block_name_prefix: rb.0.8a14.2ae8944a >>>>>> format: 1 >>>>>> >>>>>> -----Original Message----- >>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >>>>>> Behalf Of >>>>>> Edwin Peer >>>>>> Sent: Monday, January 5, 2015 3:55 AM >>>>>> To: ceph-users@lists.ceph.com >>>>>> Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day >>>>>> >>>>>> Also, which rbd objects are of interest? >>>>>> >>>>>> <snip> >>>>>> ganymede ~ # rados -p client-disk-img0 ls | wc -l >>>>>> 1672636 >>>>>> </snip> >>>>>> >>>>>> And, all of them have cryptic names like: >>>>>> >>>>>> rb.0.ff53.3d1b58ba.00000000e6ad >>>>>> rb.0.6d386.1d545c4d.000000011461 >>>>>> rb.0.50703.3804823e.000000001c28 >>>>>> rb.0.1073e.3d1b58ba.00000000b715 >>>>>> rb.0.1d76.2ae8944a.00000000022d >>>>>> >>>>>> which seem to bear no resemblance to the actual image names that the >>>>>> rbd >>>>>> command line tools understands? >>>>>> >>>>>> Regards, >>>>>> Edwin Peer >>>>>> >>>>>> On 01/04/2015 08:48 PM, Jake Young wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sunday, January 4, 2015, Dyweni - Ceph-Users >>>>>>> <6exbab4fy...@dyweni.com <mailto:6exbab4fy...@dyweni.com>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> If its the only think in your pool, you could try deleting the >>>>>>> pool instead. >>>>>>> >>>>>>> I found that to be faster in my testing; I had created 500TB >>>>>>> when >>>>>>> I meant to create 500GB. >>>>>>> >>>>>>> Note for the Devs: I would be nice if rbd create/resize would >>>>>>> accept sizes with units (i.e. MB GB TB PB, etc). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 2015-01-04 08:45, Edwin Peer wrote: >>>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> I did something stupid while growing an rbd image. I >>>>>>> accidentally >>>>>>> mistook the units of the resize command for bytes instead >>>>>>> of >>>>>>> megabytes >>>>>>> and grew an rbd image to 650PB instead of 650GB. This all >>>>>>> happened >>>>>>> instantaneously enough, but trying to rectify the mistake >>>>>>> is >>>>>>> not going >>>>>>> nearly as well. >>>>>>> >>>>>>> <snip> >>>>>>> ganymede ~ # rbd resize --size 665600 --allow-shrink >>>>>>> client-disk-img0/vol-x318644f-0 >>>>>>> Resizing image: 1% complete... >>>>>>> </snip> >>>>>>> >>>>>>> It took a couple days before it started showing 1% complete >>>>>>> and has >>>>>>> been stuck on 1% for a couple more. At this rate, I should >>>>>>> be >>>>>>> able to >>>>>>> shrink the image back to the intended size in about 2016. >>>>>>> >>>>>>> Any ideas? >>>>>>> >>>>>>> Regards, >>>>>>> Edwin Peer >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>>> >>>>>>> You can just delete the rbd header. See Sebastien's excellent blog: >>>>>>> >>>>>>> http://www.sebastien-han.fr/blog/2013/12/12/rbd-image- >>>>>>> bigger-than-your >>>>>>> -ceph-cluster/ >>>>>>> >>>>>>> Jake >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list >>>>>>> ceph-users@lists.ceph.com >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> _______________________________________________ >>>>>> ceph-users mailing list >>>>>> ceph-users@lists.ceph.com >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>> >>>>>> >>>>>> >>>>>> Sorry, I misunderstood. >>>>>> >>>>>> >>>>>> >>>>>> The simplest approach to me is to make another image of the correct >>>>>> size >>>>>> and copy your VM's file system to the new image, then delete the old >>>>>> one. >>>>>> >>>>>> >>>>>> >>>>>> The safest thing to do would be to mount the new file system from the >>>>>> VM >>>>>> and do all the formatting / copying from there (the same way you'd >>>>>> move >>>>>> a >>>>>> physical server's root disk to a new physical disk) >>>>>> >>>>>> >>>>>> >>>>>> I would not attempt to hack the rbd header. You open yourself up to >>>>>> some >>>>>> unforeseen problems. >>>>>> >>>>>> >>>>>> >>>>>> Unless one of the ceph developers can comment there is a safe way to >>>>>> shrink an image, assuming we know that the file system has not grown >>>>>> since >>>>>> growing the disk. >>>>>> >>>>>> >>>>>> >>>>>> Jake >>>>>> >>>>> >>> >>> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com