Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > One of Sage's possible plans for Manilla integration would use nfs over the > new Linux vmci sockets transport integration in qemu (below) to access > Cephfs via an nfs-ganesha server running in the host vm. Excellent job! Nice to see you were able to add AF_VSOCK support to nfs-ganesha so quickly. I'm currently working on kernel nfsd support and will send the patches to linux-nfs and CC you. Stefan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: chooseleaf may cause some unnecessary pg migrations
I just realized the measurement I mentioned last time is not precise. It should be 'number of changed mappings' instead of 'number of remapped PGs'. For example, [2,1,7] -> [0,7,4] should be regarded differently from [2,1,7] -> [1,7,4], as the first change causes doubled data transfer. Is this the reason for your test results? > -Original Message- > From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] > Sent: Monday, October 19, 2015 3:34 PM > To: Sage Weil; xusangdi 11976 (RD) > Cc: ceph-devel@vger.kernel.org > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > Thanks sage. > > Retesting by using --test --weight ${rand} 0 , still in 40 OSDs, 10 per host > case: > > New code average on 204.31 while old code average on 202.31 > > > > -Original Message- > > From: Sage Weil [mailto:s...@newdream.net] > > Sent: Monday, October 19, 2015 10:18 AM > > To: Xusangdi > > Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org > > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > > > On Mon, 19 Oct 2015, Xusangdi wrote: > > > > > > > -Original Message- > > > > From: ceph-devel-ow...@vger.kernel.org > > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chen, > > > > Xiaoxi > > > > Sent: Monday, October 19, 2015 9:11 AM > > > > To: xusangdi 11976 (RD) > > > > Cc: ceph-devel@vger.kernel.org > > > > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > > > > > > > Sorry but not following... > > > > > > > > > then shut down one or more osds (please don't touch the > > > > > crushmap, just stop the osd service or kill > > > > its process). > > > > > > > > In this case, OSD is only down but not out, but will be marked out > > > > after > > 300s. > > > > > > > > So in what case your patch is helping? > > > > > > > > If you said your patch helps on "down and out" , then my > > > > experiment is exactly the case, > > > > > > > > > > I am afraid it is probably not. Could you tell me how did you > > > simulate the osd "down and out" situation using crushtool? If it was > > > done by arguments such as '--remove-item' or 'reweight-item', it > > > modified the crushmap and is not what I'm aiming for. > > > > There is a --weight argument (noted in usage near --test, which is the > > only piece that uses it). The crush map is not modified--only the > > weight vector that is passed in when a mapping is calculated (which is > > the equivalent of the in/out state in Ceph's OSDMap). This should let you > > simulate this case. > > > > When I'm debugging/understanding these issues I usually change the > > dprintk #define at the top of crush/mapper.c and use crushtool or > > osdmaptool to calculate a single mapping, comparing the log before and > > after a particular change. > > > > sage - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
RE: chooseleaf may cause some unnecessary pg migrations
Thanks sage. Retesting by using --test --weight ${rand} 0 , still in 40 OSDs, 10 per host case: New code average on 204.31 while old code average on 202.31 > -Original Message- > From: Sage Weil [mailto:s...@newdream.net] > Sent: Monday, October 19, 2015 10:18 AM > To: Xusangdi > Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > On Mon, 19 Oct 2015, Xusangdi wrote: > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chen, Xiaoxi > > > Sent: Monday, October 19, 2015 9:11 AM > > > To: xusangdi 11976 (RD) > > > Cc: ceph-devel@vger.kernel.org > > > Subject: RE: chooseleaf may cause some unnecessary pg migrations > > > > > > Sorry but not following... > > > > > > > then shut down one or more osds (please don't touch the crushmap, > > > > just stop the osd service or kill > > > its process). > > > > > > In this case, OSD is only down but not out, but will be marked out after > 300s. > > > > > > So in what case your patch is helping? > > > > > > If you said your patch helps on "down and out" , then my > > > experiment is exactly the case, > > > > > > > I am afraid it is probably not. Could you tell me how did you simulate > > the osd "down and out" situation using crushtool? If it was done by > > arguments such as '--remove-item' or 'reweight-item', it modified the > > crushmap and is not what I'm aiming for. > > There is a --weight argument (noted in usage near --test, which is the only > piece that uses it). The crush map is not modified--only the weight vector > that is passed in when a mapping is calculated (which is the equivalent of the > in/out state in Ceph's OSDMap). This should let you simulate this case. > > When I'm debugging/understanding these issues I usually change the dprintk > #define at the top of crush/mapper.c and use crushtool or osdmaptool to > calculate a single mapping, comparing the log before and after a particular > change. > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()
On 10/18/2015 09:49 PM, Shraddha Barke wrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() > > Signed-off-by: Shraddha BarkeThis is a pretty minor comment, but the "Net" in your subject line is probably better *not* capitalized. -Alex > --- > Changes in v2- > Drop incorrect use of cursor > > net/ceph/messenger.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..b087edd 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > if (do_datacrc) > crc = con->in_data_crc; > while (cursor->resid) { > - page = ceph_msg_data_next(>cursor, _offset, , > + page = ceph_msg_data_next(cursor, _offset, , > NULL); > ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); > if (ret <= 0) { > @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > > if (do_datacrc) > crc = ceph_crc32c_page(crc, page, page_offset, ret); > - (void) ceph_msg_data_advance(>cursor, (size_t)ret); > + (void) ceph_msg_data_advance(cursor, (size_t)ret); > } > if (do_datacrc) > con->in_data_crc = crc; > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()
On Mon, Oct 19, 2015 at 4:49 AM, Shraddha Barkewrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() > > Signed-off-by: Shraddha Barke > --- > Changes in v2- > Drop incorrect use of cursor > > net/ceph/messenger.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..b087edd 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > if (do_datacrc) > crc = con->in_data_crc; > while (cursor->resid) { > - page = ceph_msg_data_next(>cursor, _offset, , > + page = ceph_msg_data_next(cursor, _offset, , > NULL); > ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); > if (ret <= 0) { > @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > > if (do_datacrc) > crc = ceph_crc32c_page(crc, page, page_offset, ret); > - (void) ceph_msg_data_advance(>cursor, (size_t)ret); > + (void) ceph_msg_data_advance(cursor, (size_t)ret); > } > if (do_datacrc) > con->in_data_crc = crc; This looks good. I'd like you to update this patch to do the same for write_partial_message_data(). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Adam C. Emerson2015-10-16 13:49:09 -0400 wip-cxx11time 2015-10-17 13:20:15 -0400 wip-cxx11concurrency Adam Crume 2014-12-01 20:45:58 -0800 wip-doc-rbd-replay Alfredo Deza 2015-03-23 16:39:48 -0400 wip-11212 Alfredo Deza 2014-07-08 13:58:35 -0400 wip-8679 2014-09-04 13:58:14 -0400 wip-8366 2014-10-13 11:10:10 -0400 wip-9730 Ali Maredia 2015-10-12 14:28:30 -0400 wip-10587-split-servers 2015-10-13 13:40:41 -0400 wip-infernalis-cmake 2015-10-15 18:39:23 -0400 wip-cmake Barbora AnÄincová 2015-10-14 12:36:24 +0200 wip-doc-RGW Boris Ranto 2015-09-04 15:19:11 +0200 wip-bash-completion Casey Bodley 2015-09-28 17:09:11 -0400 wip-cxx14-test 2015-09-29 15:18:17 -0400 wip-fio-objectstore Dan Mick 2013-07-16 23:00:06 -0700 wip-5634 Danny Al-Gaaf 2015-04-23 16:32:00 +0200 wip-da-SCA-20150421 2015-04-23 17:18:57 +0200 wip-nosetests 2015-04-23 18:20:16 +0200 wip-unify-num_objects_degraded 2015-09-28 16:05:12 +0200 wip-da-SCA-20150910 David Zafman 2014-08-29 10:41:23 -0700 wip-libcommon-rebase 2015-04-24 13:14:23 -0700 wip-cot-giant 2015-08-04 07:39:00 -0700 wip-12577-hammer 2015-09-28 11:33:11 -0700 wip-12983 2015-10-15 12:30:37 -0700 wip-13016 2015-10-18 14:48:01 -0700 wip-zafman-testing Dongmao Zhang 2014-11-14 19:14:34 +0800 thesues-master Greg Farnum 2015-04-29 21:44:11 -0700 wip-init-names 2015-07-16 09:28:24 -0700 hammer-12297 2015-10-02 13:00:59 -0700 greg-infernalis-lock-testing 2015-10-02 13:09:05 -0700 greg-infernalis-lock-testing-cacher 2015-10-07 00:45:24 -0700 greg-infernalis-fs 2015-10-13 21:41:35 -0700 greg-fs-testing Greg Farnum 2014-10-23 13:33:44 -0700 wip-forward-scrub Guang G Yang 2015-06-26 20:31:44 + wip-ec-readall 2015-07-23 16:13:19 + wip-12316 Guang Yang 2014-08-08 10:41:12 + wip-guangyy-pg-splitting 2014-09-25 00:47:46 + wip-9008 2014-09-30 10:36:39 + guangyy-wip-9614 Haomai Wang 2014-07-27 13:37:49 +0800 wip-flush-set 2015-04-20 00:47:59 +0800 update-organization 2015-07-21 19:33:56 +0800 fio-objectstore 2015-08-26 09:57:27 +0800 wip-recovery-attr Ilya Dryomov 2014-09-05 16:15:10 +0400 wip-rbd-notify-errors Ivo Jimenez 2015-08-24 23:12:45 -0700 hammer-with-new-workunit-for-wip-12551 Jason Dillaman 2015-07-31 13:55:23 -0400 wip-12383-next 2015-08-31 23:17:53 -0400 wip-12698 2015-09-01 10:17:02 -0400 wip-11287 2015-10-15 00:15:54 -0400 wip-13274-infernalis Jenkins 2015-09-30 12:59:03 -0700 rhcs-v0.94.3-ubuntu Jenkins 2014-07-29 05:24:39 -0700 wip-nhm-hang 2015-02-02 10:35:28 -0800 wip-sam-v0.92 2015-08-21 12:46:32 -0700 last 2015-08-21 12:46:32 -0700 loic-v9.0.3 2015-09-15 10:23:18 -0700 rhcs-v0.80.8 2015-09-21 16:48:32 -0700 rhcs-v0.94.1-ubuntu Joao Eduardo Luis 2014-09-10 09:39:23 +0100 wip-leveldb-get.dumpling Joao Eduardo Luis 2014-07-22 15:41:42 +0100 wip-leveldb-misc Joao Eduardo Luis 2014-09-02 17:19:52 +0100 wip-leveldb-get 2014-10-17 16:20:11 +0100 wip-paxos-fix 2014-10-21 21:32:46 +0100 wip-9675.dumpling 2015-07-27 21:56:42 +0100 wip-11470.hammer 2015-09-09 15:45:45 +0100 wip-11786.hammer Joao Eduardo Luis 2014-11-17 16:43:53 + wip-mon-osdmap-cleanup 2014-12-15 16:18:56 + wip-giant-mon-backports 2014-12-17 17:13:57 + wip-mon-backports.firefly 2014-12-17 23:15:10 + wip-mon-sync-fix.dumpling 2015-01-07 23:01:00 + wip-mon-blackhole-mlog-0.87.7 2015-01-10 02:40:42 + wip-dho-joao 2015-01-10 02:46:31 + wip-mon-paxos-fix 2015-01-26 13:00:09 + wip-mon-datahealth-fix 2015-02-04 22:36:14 + wip-10643 2015-09-09 15:43:51 +0100 wip-11786.firefly Joao Eduardo Luis 2015-05-27 23:48:45 +0100 wip-mon-scrub 2015-05-29 12:21:43 +0100 wip-11545 2015-06-05 16:12:57 +0100
Re: MDS stuck in a crash loop
On Wed, Oct 14, 2015 at 9:21 AM, John Spraywrote: > On Mon, Oct 12, 2015 at 3:36 AM, Milosz Tanski wrote: >> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski wrote: >>> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski wrote: On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski wrote: > On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski wrote: >> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum >> wrote: >>> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski >>> wrote: About an hour ago my MDSs (primary and follower) started ping-pong crashing with this message. I've spent about 30 minutes looking into it but nothing yet. This is from a 0.94.3 MDS >>> 0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc: In function 'virtual void C_IO_SM_Save::finish(int)' thread 7fd4f52ad700 time 2015-10-11 17:01:23.594089 mds/SessionMap.cc: 120: FAILED assert(r == 0) >>> >>> These "r == 0" asserts pretty much always mean that the MDS did did a >>> read or write to RADOS (the OSDs) and got an error of some kind back. >>> (Or in the case of the OSDs, access to the local filesystem returned >>> an error, etc.) I don't think these writes include any safety checks >>> which would let the MDS break it which means that probably the OSD is >>> actually returning an error — odd, but not impossible. >>> >>> Notice that the assert happened in thread 7fd4f52ad700, and look for >>> the stuff in that thread. You should be able to find an OSD op reply >>> (on the SessionMap object) coming in and reporting an error code. >>> -Greg >> >> I only two error ops in that whole MDS session. Neither one happened >> on the same thread (7f5ab6000700 in this file). But it looks like the >> only session map is the -90 "Message too long" one. >> >> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v >> 'ondisk = 0' >> -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700 1 -- >> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 >> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0 >> ondisk = -90 ((90) Message too long)) v6 182+0+0 (2955408122 0 0) >> 0x3a55d340 con 0x3d5a3c0 >> -705> 2015-10-11 20:51:11.374132 7f5ab22f4700 1 -- >> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 >> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2 >> ((2) No such file or directory)) v6 179+0+0 (1182549251 0 0) >> 0x66c5c80 con 0x3d5a7e0 >> >> Any idea what this could be Greg? > > To follow this up I found this ticket from 9 months ago: > http://tracker.ceph.com/issues/10449 In there Yan says: > > "it's a kernel bug. hang request prevents mds from trimming > completed_requests in sessionmap. there is nothing to do with mds. > (maybe we should add some code to MDS to show warning when this bug > happens)" > > When I was debugging this I saw an OSD (not cephfs client) operation > stuck for a long time along with the MDS error: > > HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow > requests; mds cluster is degraded; mds0: Behind on trimming (709/30) > 1 ops are blocked > 16777.2 sec > 1 ops are blocked > 16777.2 sec on osd.28 > > I did eventually bounce the OSD in question and it hasn't become stuck > since, but the MDS is still eating it every time with the "Message too > long" error on the session map. > > I'm not quite sure where to go from here. First time I had a chance to use the new recover tools. I was able to reply the journal, reset it and then reset the sessionmap. MDS returned back to life and so far everything looks good. Yay. Triggering this a bug/issue is a pretty interesting set of steps. >>> >>> Spoke too soon, a missing dir is now causing MDS to restart it self. >>> >>> -6> 2015-10-11 22:40:47.300169 7f580c7b9700 5 -- op tracker -- >>> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request, >>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 >>> 2015-10-11 21:34:49.224905 RETRY=36) >>> -5> 2015-10-11 22:40:47.300208 7f580c7b9700 5 -- op tracker -- >>> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request, >>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 >>> 2015-10-11 21:34:49.224905 RETRY=36) >>> -4> 2015-10-11 22:40:47.300231 7f580c7b9700 5 -- op tracker -- >>> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op: >>> client_request(client.3597476:21480382 rmdir #100015e0be2/58 >>> 2015-10-11 21:34:49.224905 RETRY=36) >>> -3> 2015-10-11 22:40:47.300284
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > One of Sage's possible plans for Manilla integration would use nfs over the > new Linux vmci sockets transport integration in qemu (below) to access > Cephfs via an nfs-ganesha server running in the host vm. What does "the host vm" mean, and why is this a particularly useful configuration? --b. > > This now experimentally works. > > some notes on running nfs-ganesha over AF_VSOCK: > > 1. need stefan hajnoczi's patches for > * linux kernel (and build w/vhost-vsock support > * qemu (and build w/vhost-vsock support) > * nfs-utils (in vm guest) > > all linked from https://github.com/stefanha?tab=repositories > > 2. host and vm guest kernels must include vhost-vsock > * host kernel should load vhost-vsock.ko > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, > e.g > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > --enable-kvm -drive > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > file=/opt/isos/f22.iso,media=cdrom -net > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > -parallel none -serial mon:stdio -device > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > 4. nfs-gansha (in host) > * need nfs-ganesha and its ntirpc rpc provider with vsock support > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > https://github.com/linuxbox2/ntirpc (vsock branch) > > * configure ganesha w/vsock support > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > 5. mount in guest w/nfs41: > (e.g., in fstab) > 2:// /vsock41 nfs > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > 0 0 > > If you try this, send feedback. > > Thanks! > > Matt > > -- > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-707-0660 > fax. 734-769-8938 > cel. 734-216-5309 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Wed, Oct 14, 2015 at 12:46 AM, Gregory Farnumwrote: > On Sun, Oct 11, 2015 at 7:36 PM, Milosz Tanski wrote: >> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski wrote: >>> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski wrote: On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski wrote: > On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski wrote: >> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum >> wrote: >>> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski >>> wrote: About an hour ago my MDSs (primary and follower) started ping-pong crashing with this message. I've spent about 30 minutes looking into it but nothing yet. This is from a 0.94.3 MDS >>> 0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc: In function 'virtual void C_IO_SM_Save::finish(int)' thread 7fd4f52ad700 time 2015-10-11 17:01:23.594089 mds/SessionMap.cc: 120: FAILED assert(r == 0) >>> >>> These "r == 0" asserts pretty much always mean that the MDS did did a >>> read or write to RADOS (the OSDs) and got an error of some kind back. >>> (Or in the case of the OSDs, access to the local filesystem returned >>> an error, etc.) I don't think these writes include any safety checks >>> which would let the MDS break it which means that probably the OSD is >>> actually returning an error — odd, but not impossible. >>> >>> Notice that the assert happened in thread 7fd4f52ad700, and look for >>> the stuff in that thread. You should be able to find an OSD op reply >>> (on the SessionMap object) coming in and reporting an error code. >>> -Greg >> >> I only two error ops in that whole MDS session. Neither one happened >> on the same thread (7f5ab6000700 in this file). But it looks like the >> only session map is the -90 "Message too long" one. >> >> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v >> 'ondisk = 0' >> -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700 1 -- >> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 >> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0 >> ondisk = -90 ((90) Message too long)) v6 182+0+0 (2955408122 0 0) >> 0x3a55d340 con 0x3d5a3c0 >> -705> 2015-10-11 20:51:11.374132 7f5ab22f4700 1 -- >> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 >> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2 >> ((2) No such file or directory)) v6 179+0+0 (1182549251 0 0) >> 0x66c5c80 con 0x3d5a7e0 >> >> Any idea what this could be Greg? > > To follow this up I found this ticket from 9 months ago: > http://tracker.ceph.com/issues/10449 In there Yan says: > > "it's a kernel bug. hang request prevents mds from trimming > completed_requests in sessionmap. there is nothing to do with mds. > (maybe we should add some code to MDS to show warning when this bug > happens)" > > When I was debugging this I saw an OSD (not cephfs client) operation > stuck for a long time along with the MDS error: > > HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow > requests; mds cluster is degraded; mds0: Behind on trimming (709/30) > 1 ops are blocked > 16777.2 sec > 1 ops are blocked > 16777.2 sec on osd.28 > > I did eventually bounce the OSD in question and it hasn't become stuck > since, but the MDS is still eating it every time with the "Message too > long" error on the session map. > > I'm not quite sure where to go from here. First time I had a chance to use the new recover tools. I was able to reply the journal, reset it and then reset the sessionmap. MDS returned back to life and so far everything looks good. Yay. Triggering this a bug/issue is a pretty interesting set of steps. >>> >>> Spoke too soon, a missing dir is now causing MDS to restart it self. >>> >>> -6> 2015-10-11 22:40:47.300169 7f580c7b9700 5 -- op tracker -- >>> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request, >>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 >>> 2015-10-11 21:34:49.224905 RETRY=36) >>> -5> 2015-10-11 22:40:47.300208 7f580c7b9700 5 -- op tracker -- >>> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request, >>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 >>> 2015-10-11 21:34:49.224905 RETRY=36) >>> -4> 2015-10-11 22:40:47.300231 7f580c7b9700 5 -- op tracker -- >>> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op: >>> client_request(client.3597476:21480382 rmdir #100015e0be2/58 >>> 2015-10-11 21:34:49.224905 RETRY=36) >>> -3> 2015-10-11 22:40:47.300284
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
Hi Bruce, -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "J. Bruce Fields"> To: "Matt Benjamin" > Cc: "Ceph Development" , "Stefan Hajnoczi" > , "Sage Weil" > > Sent: Monday, October 19, 2015 11:58:45 AM > Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha) > > On Mon, Oct 19, 2015 at 11:49:15AM -0400, Matt Benjamin wrote: > > - Original Message - > > > From: "J. Bruce Fields" > ... > > > > > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > > > > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > > > > > > > One of Sage's possible plans for Manilla integration would use nfs over > > > > the > > > > new Linux vmci sockets transport integration in qemu (below) to access > > > > Cephfs via an nfs-ganesha server running in the host vm. > > > > > > What does "the host vm" mean, and why is this a particularly useful > > > configuration? > > > > Sorry, I should say, "the vm host." > > Got it, thanks! > > > I think the claimed utility here is (at least) three-fold: > > > > 1. simplified configuration on host and guests > > 2. some claim to improved security through isolation > > So why is it especially interesting to put Ceph inside the VM and > Ganesha outside? Oh, sorry. Here Ceph (or Gluster, or, whatever underlying FS provider) is conceptually outside the vm complex altogether, Ganesha is re-exporting on the vm host, and guests access the namespace using NFS(v41). Regards, Matt > > > 3. some expectation of improved latency/performance wrt TCP > > > > Stefan sent a link to a set of slides with his original patches. Did you > > get a chance to read through those? > > > > [1] > > http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf > > Yep, thanks.--b. > > > > > Regards, > > > > Matt > > > > > > > > --b. > > > > > > > > > > > This now experimentally works. > > > > > > > > some notes on running nfs-ganesha over AF_VSOCK: > > > > > > > > 1. need stefan hajnoczi's patches for > > > > * linux kernel (and build w/vhost-vsock support > > > > * qemu (and build w/vhost-vsock support) > > > > * nfs-utils (in vm guest) > > > > > > > > all linked from https://github.com/stefanha?tab=repositories > > > > > > > > 2. host and vm guest kernels must include vhost-vsock > > > > * host kernel should load vhost-vsock.ko > > > > > > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci > > > > device, e.g > > > > > > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > > > > --enable-kvm -drive > > > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > > > > file=/opt/isos/f22.iso,media=cdrom -net > > > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > > > > -parallel none -serial mon:stdio -device > > > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > > > > > > > 4. nfs-gansha (in host) > > > > * need nfs-ganesha and its ntirpc rpc provider with vsock support > > > > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > > > > https://github.com/linuxbox2/ntirpc (vsock branch) > > > > > > > > * configure ganesha w/vsock support > > > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON > > > > -DUSE_VSOCK > > > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > > > > > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > > > > > > > 5. mount in guest w/nfs41: > > > > (e.g., in fstab) > > > > 2:// /vsock41 nfs > > > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > > > > 0 0 > > > > > > > > If you try this, send feedback. > > > > > > > > Thanks! > > > > > > > > Matt > > > > > > > > -- > > > > Matt Benjamin > > > > Red Hat, Inc. > > > > 315 West Huron Street, Suite 140A > > > > Ann Arbor, Michigan 48103 > > > > > > > > http://www.redhat.com/en/technologies/storage > > > > > > > > tel. 734-707-0660 > > > > fax. 734-769-8938 > > > > cel. 734-216-5309 > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majord...@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Fri, Oct 16, 2015 at 10:08 PM, Matt Benjaminwrote: > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > One of Sage's possible plans for Manilla integration would use nfs over the > new Linux vmci sockets transport integration in qemu (below) to access > Cephfs via an nfs-ganesha server running in the host vm. > > This now experimentally works. Very cool! Thank you for the detailed instructions, I look forward to trying this out soon. John > some notes on running nfs-ganesha over AF_VSOCK: > > 1. need stefan hajnoczi's patches for > * linux kernel (and build w/vhost-vsock support > * qemu (and build w/vhost-vsock support) > * nfs-utils (in vm guest) > > all linked from https://github.com/stefanha?tab=repositories > > 2. host and vm guest kernels must include vhost-vsock > * host kernel should load vhost-vsock.ko > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, > e.g > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > --enable-kvm -drive > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > file=/opt/isos/f22.iso,media=cdrom -net > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > -parallel none -serial mon:stdio -device > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > 4. nfs-gansha (in host) > * need nfs-ganesha and its ntirpc rpc provider with vsock support > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > https://github.com/linuxbox2/ntirpc (vsock branch) > > * configure ganesha w/vsock support > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > 5. mount in guest w/nfs41: > (e.g., in fstab) > 2:// /vsock41 nfs > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > 0 0 > > If you try this, send feedback. > > Thanks! > > Matt > > -- > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-707-0660 > fax. 734-769-8938 > cel. 734-216-5309 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
Hi Bruce, -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "J. Bruce Fields"> To: "Matt Benjamin" > Cc: "Ceph Development" , "Stefan Hajnoczi" > , "Sage Weil" > > Sent: Monday, October 19, 2015 11:13:52 AM > Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha) > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > > > One of Sage's possible plans for Manilla integration would use nfs over the > > new Linux vmci sockets transport integration in qemu (below) to access > > Cephfs via an nfs-ganesha server running in the host vm. > > What does "the host vm" mean, and why is this a particularly useful > configuration? Sorry, I should say, "the vm host." I think the claimed utility here is (at least) three-fold: 1. simplified configuration on host and guests 2. some claim to improved security through isolation 3. some expectation of improved latency/performance wrt TCP Stefan sent a link to a set of slides with his original patches. Did you get a chance to read through those? [1] http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf Regards, Matt > > --b. > > > > > This now experimentally works. > > > > some notes on running nfs-ganesha over AF_VSOCK: > > > > 1. need stefan hajnoczi's patches for > > * linux kernel (and build w/vhost-vsock support > > * qemu (and build w/vhost-vsock support) > > * nfs-utils (in vm guest) > > > > all linked from https://github.com/stefanha?tab=repositories > > > > 2. host and vm guest kernels must include vhost-vsock > > * host kernel should load vhost-vsock.ko > > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci > > device, e.g > > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > > --enable-kvm -drive > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > > file=/opt/isos/f22.iso,media=cdrom -net > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > > -parallel none -serial mon:stdio -device > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > > > 4. nfs-gansha (in host) > > * need nfs-ganesha and its ntirpc rpc provider with vsock support > > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > > https://github.com/linuxbox2/ntirpc (vsock branch) > > > > * configure ganesha w/vsock support > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > > > 5. mount in guest w/nfs41: > > (e.g., in fstab) > > 2:// /vsock41 nfs > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > > 0 0 > > > > If you try this, send feedback. > > > > Thanks! > > > > Matt > > > > -- > > Matt Benjamin > > Red Hat, Inc. > > 315 West Huron Street, Suite 140A > > Ann Arbor, Michigan 48103 > > > > http://www.redhat.com/en/technologies/storage > > > > tel. 734-707-0660 > > fax. 734-769-8938 > > cel. 734-216-5309 > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Mon, Oct 19, 2015 at 11:49:15AM -0400, Matt Benjamin wrote: > - Original Message - > > From: "J. Bruce Fields"... > > > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote: > > > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > > > > > One of Sage's possible plans for Manilla integration would use nfs over > > > the > > > new Linux vmci sockets transport integration in qemu (below) to access > > > Cephfs via an nfs-ganesha server running in the host vm. > > > > What does "the host vm" mean, and why is this a particularly useful > > configuration? > > Sorry, I should say, "the vm host." Got it, thanks! > I think the claimed utility here is (at least) three-fold: > > 1. simplified configuration on host and guests > 2. some claim to improved security through isolation So why is it especially interesting to put Ceph inside the VM and Ganesha outside? > 3. some expectation of improved latency/performance wrt TCP > > Stefan sent a link to a set of slides with his original patches. Did you get > a chance to read through those? > > [1] > http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf Yep, thanks.--b. > > Regards, > > Matt > > > > > --b. > > > > > > > > This now experimentally works. > > > > > > some notes on running nfs-ganesha over AF_VSOCK: > > > > > > 1. need stefan hajnoczi's patches for > > > * linux kernel (and build w/vhost-vsock support > > > * qemu (and build w/vhost-vsock support) > > > * nfs-utils (in vm guest) > > > > > > all linked from https://github.com/stefanha?tab=repositories > > > > > > 2. host and vm guest kernels must include vhost-vsock > > > * host kernel should load vhost-vsock.ko > > > > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci > > > device, e.g > > > > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > > > --enable-kvm -drive > > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > > > file=/opt/isos/f22.iso,media=cdrom -net > > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > > > -parallel none -serial mon:stdio -device > > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > > > > > 4. nfs-gansha (in host) > > > * need nfs-ganesha and its ntirpc rpc provider with vsock support > > > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > > > https://github.com/linuxbox2/ntirpc (vsock branch) > > > > > > * configure ganesha w/vsock support > > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON > > > -DUSE_VSOCK > > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > > > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > > > > > 5. mount in guest w/nfs41: > > > (e.g., in fstab) > > > 2:// /vsock41 nfs > > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > > > 0 0 > > > > > > If you try this, send feedback. > > > > > > Thanks! > > > > > > Matt > > > > > > -- > > > Matt Benjamin > > > Red Hat, Inc. > > > 315 West Huron Street, Suite 140A > > > Ann Arbor, Michigan 48103 > > > > > > http://www.redhat.com/en/technologies/storage > > > > > > tel. 734-707-0660 > > > fax. 734-769-8938 > > > cel. 734-216-5309 > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
Sage, I fully support that. If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring). Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes. Thanks & Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 12:49 PM To: ceph-devel@vger.kernel.org Subject: newstore direction The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. - XFS is (probably) never going going to give us data checksums, which we want desperately. But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those aren't currently fungible. - We have to write and maintain an allocator. I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized). For disk we may beed to be moderately clever. - We'll need a fsck to ensure our internal metadata is consistent. The good news is it'll just need to validate what we have stored in the kv store. Other thoughts: - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas. - Rocksdb can push colder data to a second directory, so we could have a fast ssd primary area (for wal and most metadata) and a second hdd directory for stuff it has to push off. Then have a conservative amount of file space on the hdd. If our block fills up, use the existing file mechanism to put data there too. (But then we have to maintain both the current kv + file approach and not go all-in on kv + block.) Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored
Re: newstore direction
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I think there is a lot that can be gained by Ceph managing a raw block device. As I mentioned on ceph-users, I've given this some though and a lot of optimizations could be done that is conducive to storing objects. I didn't think however to bypass VFS all together by opening the raw device directly, but this would make things simpler as you don't have to program things for VFS that don't make sense. Some of my thoughts were to employ a hashing algorithm for inode lookup (CRUSH like). Is there a good use case for listing a directory? We may need to keep a list for deletion, but there may be a better way to handle this. Is there a need to do snapshots at the block layer if operations can be atomic? Is there a real advantage to have an allocation as small as 4K, or does it make since to use something like 512K? I'm interested in how this might pan out. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWJVEACRDmVDuy+mK58QAAIQEQAK9GUmGQBP1wYa9yXNEp juofzj5SCxxiNCBdY3kkdHXELCWkLGn331JX2El8h1lPaqH8/nWNy4U6hx0s 7A5EBgQp7+LN03OLroSfiSccPhEe5B/OB1cnyZjmxwDXyaMJzqXwn231f5ev lBEzvU5PpHrMdNIIGxNFEHgduxfPIw5ciOokP27Tle1JdAGSn6fL6nRLtQfd HmVLnnXJT9zaGRyxnL8ZQU8IlfjfhMpIc1bM3QKkQkBmXanzCaNaULrlO35L XtIy0fEXAjkcGHpxOTz4yx5OFKwkpirFduU2PBn+5kqxPRvGL/eEzIxTV89c SfhAkyBFpl+g7G+q532i7L/34r2wXOL7wcn9seLdOZIt1LVnb059r0tpy4Fz X/V2/ao1Fua2BFMYzMskPXiKFzxLu/jOS12CjvYWkNhN4C2pGUbRxhqYnC0k gjRpoOZHDr+RogQdlzXeUmcbZzvtwWqk2uECIX2mLR1aHTVgnpegJhvvHdl3 Nm7jxLyTof2bcXQgSwO5YEXvWO3dNfQynrb5zE+aIVM5ps9D95Mmm94lJtda 47zraQNwrL1OVS7Fd4ot9VepLcQ4orCUZPSqrm5FBlBWj5G+/U0F8VQl8u/g /nSZrxMXjHJWRhFvzFMYC3yUp59N75LXR5wId8RkAkgZVM+PftB4LmB7spHC WcGR =j3i1 -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Oct 19, 2015 at 1:49 PM, Sage Weilwrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate
RE: newstore direction
On Mon, 19 Oct 2015, Somnath Roy wrote: > Sage, > I fully support that. If we want to saturate SSDs , we need to get rid > of this filesystem overhead (which I am in process of measuring). Also, > it will be good if we can eliminate the dependency on the k/v dbs (for > storing allocators and all). The reason is the unknown write amps they > causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash). sage > > Thanks & Regards > Somnath > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 12:49 PM > To: ceph-devel@vger.kernel.org > Subject: newstore direction > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, write-ahead > logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at > a minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of rgw > index data or cephfs metadata? Suddenly we are pulling storage out of a > different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this can > be reasonbly simple, especially for the flash case (where fragmentation isn't > such an issue as long as our blocks are reasonbly sized). For disk we may > beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The good > news is it'll just need to validate what we have stored in the kv store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block layers > might help us with elasticity of file vs block areas. > > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd directory > for stuff it has to push off. Then have a conservative amount of file space > on the hdd. If our block fills up, use the existing file mechanism to put > data there too. (But then we have to maintain both the current kv + file > approach and not go all-in on kv + block.) > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this
v0.94.4 Hammer released
This Hammer point fixes several important bugs in Hammer, as well as fixing interoperability issues that are required before an upgrade to Infernalis. That is, all users of earlier version of Hammer or any version of Firefly will first need to upgrade to hammer v0.94.4 or later before upgrading to Infernalis (or future releases). All v0.94.x Hammer users are strongly encouraged to upgrade. Changes --- * build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166, Nathan Cutler) * build/ops: ceph.spec.in: ceph-common needs python-argparse on older distros, but doesn't require it (#12034, Nathan Cutler) * build/ops: ceph.spec.in: radosgw requires apache for SUSE only -- makes no sense (#12358, Nathan Cutler) * build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized (#11991, Nathan Cutler) * build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992, Owen Synge) * build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan Cutler) * build/ops: ceph.spec.in: snappy-devel for all supported distros (#12361, Nathan Cutler) * build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel (#11629, Nathan Cutler) * build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build (#12351, Nathan Cutler) * build/ops: error in ext_mime_map_init() when /etc/mime.types is missing (#11864, Ken Dreyer) * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s) (#11798, Sage Weil) * build/ops: With root as default user, unable to have multiple RGW instances running (#10927, Sage Weil) * build/ops: With root as default user, unable to have multiple RGW instances running (#11140, Sage Weil) * build/ops: With root as default user, unable to have multiple RGW instances running (#11686, Sage Weil) * build/ops: With root as default user, unable to have multiple RGW instances running (#12407, Sage Weil) * cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu Chai) * cli: ceph tell: broken error message / misleading hinting (#11101, Kefu Chai) * common: arm: all programs that link to librados2 hang forever on startup (#12505, Boris Ranto) * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang) * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer objects (#13070, Sage Weil) * common: do not insert emtpy ptr when rebuild emtpy bufferlist (#12775, Xinze Chi) * common: [ FAILED ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman) * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng) * common: Memory leak in Mutex.cc, pthread_mutexattr_init without pthread_mutexattr_destroy (#11762, Ketor Meng) * common: object_map_update fails with -EINVAL return code (#12611, Jason Dillaman) * common: Pipe: Drop connect_seq increase line (#13093, Haomai Wang) * common: recursive lock of md_config_t (0) (#12614, Josh Durgin) * crush: ceph osd crush reweight-subtree does not reweight parent node (#11855, Sage Weil) * doc: update docs to point to download.ceph.com (#13162, Alfredo Deza) * fs: ceph-fuse 0.94.2-1trusty segfaults / aborts (#12297, Greg Farnum) * fs: segfault launching ceph-fuse with bad --name (#12417, John Spray) * librados: Change radosgw pools default crush ruleset (#11640, Yuan Zhou) * librbd: correct issues discovered via lockdep / helgrind (#12345, Jason Dillaman) * librbd: Crash during TestInternal.MultipleResize (#12664, Jason Dillaman) * librbd: deadlock during cooperative exclusive lock transition (#11537, Jason Dillaman) * librbd: Possible crash while concurrently writing and shrinking an image (#11743, Jason Dillaman) * mon: add a cache layer over MonitorDBStore (#12638, Kefu Chai) * mon: fix crush testing for new pools (#13400, Sage Weil) * mon: get pools health'info have error (#12402, renhwztetecs) * mon: implicit erasure code crush ruleset is not validated (#11814, Loic Dachary) * mon: PaxosService: call post_refresh() instead of post_paxos_update() (#11470, Joao Eduardo Luis) * mon: pgmonitor: wrong at/near target maxâ reporting (#12401, huangjun) * mon: register_new_pgs() should check ruleno instead of its index (#12210, Xinze Chi) * mon: Show osd as NONE in ceph osd mapoutput (#11820, Shylesh Kumar) * mon: the output is wrong when runing ceph osd reweight (#12251, Joao Eduardo Luis) * osd: allow peek_map_epoch to return an error (#13060, Sage Weil) * osd: cache agent is idle although one object is left in the cache (#12673, Loic Dachary) * osd: copy-from doesn't preserve truncate_{seq,size} (#12551, Samuel Just) * osd: crash creating/deleting pools (#12429, John Spray) * osd: fix repair when recorded digest is wrong (#12577, Sage Weil) * osd: include/ceph_features: define HAMMER_0_94_4 feature (#13026, Sage Weil) * osd: is_new_interval() fixes (#10399, Jason Dillaman) * osd: is_new_interval() fixes (#11771, Jason Dillaman) * osd: long standing slow requests: connection->session->waiting_for_map->connection ref cycle (#12338, Samuel Just) * osd: Mutex Assert from
Re: newstore direction
On 10/19/2015 09:49 PM, Sage Weil wrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate what we have stored in the kv > store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers might help us with elasticity of file vs block areas. > I've been using bcache for a while now in production and that helped a lot. Intel SSDs with GPT. First few partitions as Journals and then one big partition for bcache. /dev/bcache02.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-60 /dev/bcache12.8T 317G 2.5T 12% /var/lib/ceph/osd/ceph-61 /dev/bcache22.8T 303G 2.5T 11% /var/lib/ceph/osd/ceph-62 /dev/bcache32.8T 316G 2.5T 12% /var/lib/ceph/osd/ceph-63 /dev/bcache42.8T 167G 2.6T 6% /var/lib/ceph/osd/ceph-64 /dev/bcache52.8T 295G 2.5T 11% /var/lib/ceph/osd/ceph-65 The maintainers from bcache also presented bcachefs: https://lkml.org/lkml/2015/8/21/22 "checksumming, compression: currently only zlib is supported for compression, and for checksumming there's crc32c and a 64 bit checksum." Wouldn't that be something that can be leveraged from? Consuming a raw block device seems like re-inventing the wheel to me. I might be wrong though. I have no idea how stable bcachefs is, but it might be worth looking in to. > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd > directory for stuff it has to push off. Then have a conservative amount > of file space on the hdd. If our block fills up, use the existing file > mechanism to put data there too. (But then we have to maintain both the > current kv + file approach and not go all-in on kv + block.) > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at
newstore direction
The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. - XFS is (probably) never going going to give us data checksums, which we want desperately. But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those aren't currently fungible. - We have to write and maintain an allocator. I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized). For disk we may beed to be moderately clever. - We'll need a fsck to ensure our internal metadata is consistent. The good news is it'll just need to validate what we have stored in the kv store. Other thoughts: - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas. - Rocksdb can push colder data to a second directory, so we could have a fast ssd primary area (for wal and most metadata) and a second hdd directory for stuff it has to push off. Then have a conservative amount of file space on the hdd. If our block fills up, use the existing file mechanism to put data there too. (But then we have to maintain both the current kv + file approach and not go all-in on kv + block.) Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
Hi Sage, If we are managing the raw device, does it make sense to have a key value store to manage the whole space? Having metadata of the allocator might cause some other problems of consistency. Getting an fsck for that implementation can be tougher, we might have to have strict crc computations on the data. And have to manage sanity of the DB managing them. If we can have a common mechanism of having data and metadata the same keyvalue store, will improve the performance. We have integrated a custom made key value store which works on raw device the key value store backend. And we have observed better bw utilization and iops. Read/writes can be faster and no fslookup needed. We have tools like fsck to care of consistency of DB. Couple of comments inline. Thanks, Varada > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Tuesday, October 20, 2015 1:19 AM > To: ceph-devel@vger.kernel.org > Subject: newstore direction > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal metadata > (object metadata, attrs, layout, collection membership, write-ahead logging, > overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one > for the kv txn to commit (at least once my rocksdb changes land... the kv > commit is currently 2-3). So two people are managing metadata, here: the fs > managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at a > minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and consume > a raw block device directly. Write an allocator, hopefully keep it pretty > simple, and manage it in kv store along with all of our other metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of rgw > index data or cephfs metadata? Suddenly we are pulling storage out of a > different pool and those aren't currently fungible. [Varada Kari] Ideally if we can manage the raw device as key value store indirection to manage metadata and data both, we can benefit with faster lookups and writes (if the KVStore supports a batch atomic transactional write). SSD's might suffer with more write amplification by putting the meta data alone, if we can manage this part(KV Store to deal with raw device) also(handling small writes) we can avoid write amplification and get better throughput from the device. > - We have to write and maintain an allocator. I'm still optimistic this can > be > reasonbly simple, especially for the flash case (where fragmentation isn't > such an issue as long as our blocks are reasonbly sized). For disk we may > beed to be moderately clever. > [Varada Kari] Yes. If the writes are aligned to flash programmable page size, that will not cause any issues. But writes less than programmable page size will cause internal fragmentation. Repeated overwrites to the same, will cause more write amplification. > - We'll need a fsck to ensure our internal metadata is consistent. The good > news is it'll just need to validate what we have stored in the kv store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers
RE: newstore direction
Hi Sage and Somnath, In my humble opinion, There is another more aggressive solution than raw block device base keyvalue store as backend for objectstore. The new key value SSD device with transaction support would be ideal to solve the issues. First of all, it is raw SSD device. Secondly , It provides key value interface directly from SSD. Thirdly, it can provide transaction support, consistency will be guaranteed by hardware device. It pretty much satisfied all of objectstore needs without any extra overhead since there is not any extra layer in between device and objectstore. Either way, I strongly support to have CEPH own data format instead of relying on filesystem. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 1:55 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction On Mon, 19 Oct 2015, Somnath Roy wrote: > Sage, > I fully support that. If we want to saturate SSDs , we need to get > rid of this filesystem overhead (which I am in process of measuring). > Also, it will be good if we can eliminate the dependency on the k/v > dbs (for storing allocators and all). The reason is the unknown write > amps they causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash). sage > > Thanks & Regards > Somnath > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 12:49 PM > To: ceph-devel@vger.kernel.org > Subject: newstore direction > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb > changes land... the kv commit is currently 2-3). So two people are > managing metadata, here: the fs managing the file metadata (with its > own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at > a minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put > metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of rgw > index data or cephfs metadata? Suddenly we are pulling storage out of a > different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this can > be reasonbly simple, especially for the flash case (where fragmentation isn't > such an issue as long as our blocks are reasonbly sized). For disk we may > beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The good > news is it'll just need to validate what we have stored in the kv store. > > Other thoughts: > > - We might want to consider whether
Re: newstore direction
On Mon, Oct 19, 2015 at 8:49 PM, Sage Weilwrote: > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. This is the concerning bit for me -- the other parts one "just" has to get the code right, but this problem could linger and be something we have to keep explaining to users indefinitely. It reminds me of cases in other systems where users had to make an educated guess about inode size up front, depending on whether you're expecting to efficiently store a lot of xattrs. In practice it's rare for users to make these kinds of decisions well up-front: it really needs to be adjustable later, ideally automatically. That could be pretty straightforward if the KV part was stored directly on block storage, instead of having XFS in the mix. I'm not quite up with the state of the art in this area: are there any reasonable alternatives for the KV part that would consume some defined range of a block device from userspace, instead of sitting on top of a filesystem? John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
+1, nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still under development. > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > Sent: Tuesday, October 20, 2015 6:21 AM > To: Sage Weil; Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive solution than raw > block device base keyvalue store as backend for objectstore. The new key > value SSD device with transaction support would be ideal to solve the > issues. > First of all, it is raw SSD device. Secondly , It provides key value interface > directly from SSD. Thirdly, it can provide transaction support, consistency > will > be guaranteed by hardware device. It pretty much satisfied all of objectstore > needs without any extra overhead since there is not any extra layer in > between device and objectstore. >Either way, I strongly support to have CEPH own data format instead of > relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > Sage, > > I fully support that. If we want to saturate SSDs , we need to get > > rid of this filesystem overhead (which I am in process of measuring). > > Also, it will be good if we can eliminate the dependency on the k/v > > dbs (for storing allocators and all). The reason is the unknown write > > amps they causes. > > My hope is to keep behing the KeyValueDB interface (and/more change it as > appropriate) so that other backends can be easily swapped in (e.g. a btree- > based one for high-end flash). > > sage > > > > > > Thanks & Regards > > Somnath > > > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 12:49 PM > > To: ceph-devel@vger.kernel.org > > Subject: newstore direction > > > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb > > changes land... the kv commit is currently 2-3). So two people are > > managing metadata, here: the fs managing the file metadata (with its > > own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at a > minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > > > - No concern about mtime getting in the way > > > > - Faster reads (no fs lookup) > > > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > > > Problems: > > > > - We have to size the kv backend storage (probably still an XFS > > partition) vs the block storage. Maybe
RE: newstore direction
There is something like : http://pmem.io/nvml/libpmemobj/ to adapt NVMe to transactional object storage. But definitely need some more works > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent: Tuesday, October 20, 2015 10:33 AM > To: James (Fei) Liu-SSI; Sage Weil; Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi James, > > Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? > If SCSI OSD is what you are mentioning, drive has to support all osd > functionality mentioned by T10. > If not, we have to implement the same functionality in kernel or have a > wrapper in user space to convert them to read/write calls. This seems more > effort. > > Varada > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > > Sent: Tuesday, October 20, 2015 3:51 AM > > To: Sage Weil; Somnath Roy > > > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > Hi Sage and Somnath, > > In my humble opinion, There is another more aggressive solution > > than raw block device base keyvalue store as backend for objectstore. > > The new key value SSD device with transaction support would be ideal > > to solve the issues. First of all, it is raw SSD device. Secondly , It > > provides key value interface directly from SSD. Thirdly, it can > > provide transaction support, consistency will be guaranteed by > > hardware device. It pretty much satisfied all of objectstore needs > > without any extra overhead since there is not any extra layer in between > device and objectstore. > >Either way, I strongly support to have CEPH own data format instead > > of relying on filesystem. > > > > Regards, > > James > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > > ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 1:55 PM > > To: Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > > Sage, > > > I fully support that. If we want to saturate SSDs , we need to get > > > rid of this filesystem overhead (which I am in process of measuring). > > > Also, it will be good if we can eliminate the dependency on the k/v > > > dbs (for storing allocators and all). The reason is the unknown > > > write amps they causes. > > > > My hope is to keep behing the KeyValueDB interface (and/more change it > > as > > appropriate) so that other backends can be easily swapped in (e.g. a > > btree- based one for high-end flash). > > > > sage > > > > > > > > > > Thanks & Regards > > > Somnath > > > > > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > > Sent: Monday, October 19, 2015 12:49 PM > > > To: ceph-devel@vger.kernel.org > > > Subject: newstore direction > > > > > > The current design is based on two simple ideas: > > > > > > 1) a key/value interface is better way to manage all of our > > > internal metadata (object metadata, attrs, layout, collection > > > membership, write-ahead logging, overlay data, etc.) > > > > > > 2) a file system is well suited for storage object data (as files). > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. > > > A few > > > things: > > > > > > - We currently write the data to the file, fsync, then commit the > > > kv transaction. That's at least 3 IOs: one for the data, one for > > > the fs journal, one for the kv txn to commit (at least once my > > > rocksdb changes land... the kv commit is currently 2-3). So two > > > people are managing metadata, here: the fs managing the file > > > metadata (with its own > > > journal) and the kv backend (with its journal). > > > > > > - On read we have to open files by name, which means traversing the > > > fs > > namespace. Newstore tries to keep it as flat and simple as possible, > > but at a minimum it is a couple btree lookups. We'd love to use open > > by handle (which would reduce this to 1 btree traversal), but running > > the daemon as ceph and not root makes that hard... > > > > > > - ...and file systems insist on updating mtime on writes, even when > > > it is a > > overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > > > > > - XFS is (probably) never going going to give us data checksums, > > > which we > > want desperately. > > > > > > But what's the alternative? My thought is to just bite the bullet > > > and > > consume a raw block device directly. Write an allocator, hopefully > > keep it pretty simple, and manage it in kv store along with
RE: newstore direction
Hi James, Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? If SCSI OSD is what you are mentioning, drive has to support all osd functionality mentioned by T10. If not, we have to implement the same functionality in kernel or have a wrapper in user space to convert them to read/write calls. This seems more effort. Varada > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > Sent: Tuesday, October 20, 2015 3:51 AM > To: Sage Weil; Somnath Roy > > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive solution than raw > block device base keyvalue store as backend for objectstore. The new key > value SSD device with transaction support would be ideal to solve the > issues. First of all, it is raw SSD device. Secondly , It provides key value > interface directly from SSD. Thirdly, it can provide transaction support, > consistency will be guaranteed by hardware device. It pretty much satisfied > all of objectstore needs without any extra overhead since there is not any > extra layer in between device and objectstore. >Either way, I strongly support to have CEPH own data format instead of > relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > Sage, > > I fully support that. If we want to saturate SSDs , we need to get > > rid of this filesystem overhead (which I am in process of measuring). > > Also, it will be good if we can eliminate the dependency on the k/v > > dbs (for storing allocators and all). The reason is the unknown write > > amps they causes. > > My hope is to keep behing the KeyValueDB interface (and/more change it as > appropriate) so that other backends can be easily swapped in (e.g. a btree- > based one for high-end flash). > > sage > > > > > > Thanks & Regards > > Somnath > > > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 12:49 PM > > To: ceph-devel@vger.kernel.org > > Subject: newstore direction > > > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb > > changes land... the kv commit is currently 2-3). So two people are > > managing metadata, here: the fs managing the file metadata (with its > > own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at a > minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > > > - No concern about mtime getting in the way > > > > - Faster reads (no fs lookup) > > > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > > > Problems: > > > > - We have to size the kv backend
Re: newstore direction
On Tue, Oct 20, 2015 at 3:49 AM, Sage Weilwrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. This is really a tough decision. Although making a block device based objectstore never walk out my mind since two years ago. We would much more concern about the effective of space utilization compared to local fs, the buggy, the consuming time to build a tiny local filesystem. I'm a little afraid of we would stuck into > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL area from my perf. > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate what we have stored in the kv > store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers might help us with elasticity of file vs block areas. > > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd > directory for stuff it has to push off. Then have a conservative amount > of file space on the hdd. If our block fills up, use the existing file > mechanism to put data there too. (But then we have to maintain both the > current kv + file approach and not go all-in on kv + block.) A complex way... Actually I would like to employ FileStore2 impl, which means we still use FileJournal(or alike ..). But we need to employ more memory to keep metadata/xattrs and use aio+dio to flush disk. A userspace pagecache needed to be impl. Then we can skip journal if full write, because osd is pg isolation we could make a barrier for single pg when skipping journal. @Sage Is there other concerns for filestore skip journal? In a word, I like the model that filestore owns, but we need to have a big refactor for existing impl. Sorry to disturb the thought > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat --
Re: [PATCH v3] net: ceph: messenger: Use local variable cursor instead of >cursor
On Mon, Oct 19, 2015 at 6:29 PM, Shraddha Barkewrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() and write_partial_msg_data() > > Signed-off-by: Shraddha Barke > --- > Changes in v3- > Replace >cursor with cursor in write_partial_msg_data() too > > net/ceph/messenger.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..01b567b 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -1552,8 +1552,8 @@ static int write_partial_message_data(struct > ceph_connection *con) > bool need_crc; > int ret; > > - page = ceph_msg_data_next(>cursor, _offset, , > - _piece); > + page = ceph_msg_data_next(cursor, _offset, , > + _piece); > ret = ceph_tcp_sendpage(con->sock, page, page_offset, > length, !last_piece); > if (ret <= 0) { > @@ -1564,7 +1564,7 @@ static int write_partial_message_data(struct > ceph_connection *con) > } > if (do_datacrc && cursor->need_crc) > crc = ceph_crc32c_page(crc, page, page_offset, > length); > - need_crc = ceph_msg_data_advance(>cursor, (size_t)ret); > + need_crc = ceph_msg_data_advance(cursor, (size_t)ret); > } > > dout("%s %p msg %p done\n", __func__, con, msg); > @@ -2246,8 +2246,8 @@ static int read_partial_msg_data(struct ceph_connection > *con) > if (do_datacrc) > crc = con->in_data_crc; > while (cursor->resid) { > - page = ceph_msg_data_next(>cursor, _offset, , > - NULL); > + page = ceph_msg_data_next(cursor, _offset, , > + NULL); > ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); > if (ret <= 0) { > if (do_datacrc) > @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > > if (do_datacrc) > crc = ceph_crc32c_page(crc, page, page_offset, ret); > - (void) ceph_msg_data_advance(>cursor, (size_t)ret); > + (void) ceph_msg_data_advance(cursor, (size_t)ret); > } > if (do_datacrc) > con->in_data_crc = crc; Applied, see https://github.com/ceph/ceph-client/commit/621a56fd69751d263795f4f35e65eff7daa3a470. Generally, we prefix net/ceph commits with "libceph". Different subsystems have different conventions, use "git log --oneline " to get a sense of what is preferred. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash when deep-scrubbing
Jiaying Ren gmail.com> writes: > > Hi, cephers: > > I've encountered a problem that a pg stuck in inconsistent status: > > $ ceph -s > cluster 27d39faa-48ae-4356-a8e3-19d5b81e179e > health HEALTH_ERR 1 pgs inconsistent; 34 near full osd(s); 1 > scrub errors; noout flag(s) set > monmap e4: 3 mons at > {server-61.0..x.in=10.8.0.61:6789/0,server-62.0..x.i n=10.8.0.62:6789/0,server-63.0..x.in=10.8.0.63:6789/0}, > election epoch 6706, quorum 0,1,2 > server-61.0..x.in,server-62.0..x.in,server-63.0. .x.in > osdmap e87808: 180 osds: 180 up, 180 in > flags noout > pgmap v29322850: 35026 pgs, 15 pools, 27768 GB data, 1905 kobjects > 83575 GB used, 114 TB / 196 TB avail >35025 active+clean >1 active+clean+inconsistent > client io 120 kB/s rd, 216 MB/s wr, 6398 op/s > > `pg repair` cmd doesn't work, so I manually repaired a inconsistent object(pool > size is 3,I removed the object different from other two copys).after that pg > still in inconsistent status: > > $ ceph pg dump | grep active+clean+inconsistent > dumped all in format plain > 3.d70 290 0 0 0 4600869888 30503050 > stale+active+clean+inconsistent 2015-10-18 13:05:43.320451 > 87798'7631234 87798:10758311[131,119,132] 131 > [131,119,132] 131 85161'7599152 2015-10-16 14:34:21.283303 > 85161'7599152 2015-10-16 14:34:21.283303 > > And after restarted osd.131, the primary osd osd.131 would crash,the straceback: > > 1: /usr/bin/ceph-osd() [0x9c6de1] > 2: (()+0xf790) [0x7f384b6b8790] > 3: (gsignal()+0x35) [0x7f384a58a625] > 4: (abort()+0x175) [0x7f384a58be05] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f384ae44a5d] > 6: (()+0xbcbe6) [0x7f384ae42be6] > 7: (()+0xbcc13) [0x7f384ae42c13] > 8: (()+0xbcd0e) [0x7f384ae42d0e] > 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x9cd0de] > 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x81) [0x7dfaf1] > 11: (PG::_scan_snaps(ScrubMap&)+0x394) [0x84b8c4] > 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, > ThreadPool::TPHandle&)+0x27b) [0x84cdab] > 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x5c4) [0x85c1b4] > 14: (PG::scrub(ThreadPool::TPHandle&)+0x181) [0x85d691] > 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x1c) [0x6737cc] > 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x53d) [0x9e05dd] > 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760] > 18: (()+0x7a51) [0x7f384b6b0a51] > 19: (clone()+0x6d) [0x7f384a6409ad] > > ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also > cause osd crash. > > Any ideas? or did I missed some logs necessary for further investigation? > > Thx. > > -- > Best Regards! > Jiaying Ren(mikulely) > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > I have met a problem when run 'ceph pg deep-scrub' command. It also causes osd crash. And finally i find some sector of the disk have corrupted .so please check dmesg info to check weather there is some disk errors -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] net: ceph: messenger: Use local variable cursor instead of >cursor
Use local variable cursor in place of >cursor in read_partial_msg_data() and write_partial_msg_data() Signed-off-by: Shraddha Barke--- Changes in v3- Replace >cursor with cursor in write_partial_msg_data() too net/ceph/messenger.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index b9b0e3b..01b567b 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1552,8 +1552,8 @@ static int write_partial_message_data(struct ceph_connection *con) bool need_crc; int ret; - page = ceph_msg_data_next(>cursor, _offset, , - _piece); + page = ceph_msg_data_next(cursor, _offset, , + _piece); ret = ceph_tcp_sendpage(con->sock, page, page_offset, length, !last_piece); if (ret <= 0) { @@ -1564,7 +1564,7 @@ static int write_partial_message_data(struct ceph_connection *con) } if (do_datacrc && cursor->need_crc) crc = ceph_crc32c_page(crc, page, page_offset, length); - need_crc = ceph_msg_data_advance(>cursor, (size_t)ret); + need_crc = ceph_msg_data_advance(cursor, (size_t)ret); } dout("%s %p msg %p done\n", __func__, con, msg); @@ -2246,8 +2246,8 @@ static int read_partial_msg_data(struct ceph_connection *con) if (do_datacrc) crc = con->in_data_crc; while (cursor->resid) { - page = ceph_msg_data_next(>cursor, _offset, , - NULL); + page = ceph_msg_data_next(cursor, _offset, , + NULL); ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); if (ret <= 0) { if (do_datacrc) @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection *con) if (do_datacrc) crc = ceph_crc32c_page(crc, page, page_offset, ret); - (void) ceph_msg_data_advance(>cursor, (size_t)ret); + (void) ceph_msg_data_advance(cursor, (size_t)ret); } if (do_datacrc) con->in_data_crc = crc; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes userspace think it needs to retry the write. Fix it - it's a leftover from the times when the equivalent of rbd_dev_create() was the first action in rbd_add(). Signed-off-by: Ilya Dryomov--- drivers/block/rbd.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 07f666f4ca18..df795deffe77 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5394,7 +5394,7 @@ static ssize_t do_rbd_add(struct bus_type *bus, struct rbd_spec *spec = NULL; struct rbd_client *rbdc; bool read_only; - int rc = -ENOMEM; + int rc; if (!try_module_get(THIS_MODULE)) return -ENODEV; @@ -5429,8 +5429,10 @@ static ssize_t do_rbd_add(struct bus_type *bus, } rbd_dev = rbd_dev_create(rbdc, spec, rbd_opts); - if (!rbd_dev) + if (!rbd_dev) { + rc = -ENOMEM; goto err_out_client; + } rbdc = NULL;/* rbd_dev now owns this */ spec = NULL;/* rbd_dev now owns this */ rbd_opts = NULL;/* rbd_dev now owns this */ -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: set device_type::release instead of device::release
No point in providing an empty device_type::release callback and then setting device::release for each rbd_dev dynamically. Signed-off-by: Ilya Dryomov--- drivers/block/rbd.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4917de726bdb..fa6767e9ed2a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3986,14 +3986,12 @@ static const struct attribute_group *rbd_attr_groups[] = { NULL }; -static void rbd_sysfs_dev_release(struct device *dev) -{ -} +static void rbd_dev_release(struct device *dev); static struct device_type rbd_device_type = { .name = "rbd", .groups = rbd_attr_groups, - .release= rbd_sysfs_dev_release, + .release= rbd_dev_release, }; static struct rbd_spec *rbd_spec_get(struct rbd_spec *spec) @@ -4074,7 +4072,6 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, rbd_dev->dev.bus = _bus_type; rbd_dev->dev.type = _device_type; rbd_dev->dev.parent = _root_dev; - rbd_dev->dev.release = rbd_dev_release; device_initialize(_dev->dev); rbd_dev->rbd_client = rbdc; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: don't free rbd_dev outside of the release callback
struct rbd_device has struct device embedded in it, which means it's part of kobject universe and has an unpredictable life cycle. Freeing its memory outside of the release callback is flawed, yet commits 200a6a8be5db ("rbd: don't destroy rbd_dev in device release function") and 8ad42cd0c002 ("rbd: don't have device release destroy rbd_dev") moved rbd_dev_destroy() out to rbd_dev_image_release(). This commit reverts most of that, the key points are: - rbd_dev->dev is initialized in rbd_dev_create(), making it possible to use rbd_dev_destroy() - which is just a put_device() - both before we register with device core and after. - rbd_dev_release() (the release callback) is the only place we kfree(rbd_dev). It's also where we do module_put(), keeping the module unload race window as small as possible. - We pin the module in rbd_dev_create(), but only for mapping rbd_dev-s. Moving image related stuff out of struct rbd_device into another struct which isn't tied with sysfs and device core is long overdue, but until that happens, this will keep rbd module refcount (which users can observe with lsmod) sane. Fixes: http://tracker.ceph.com/issues/12697 Cc: Alex ElderSigned-off-by: Ilya Dryomov --- drivers/block/rbd.c | 89 - 1 file changed, 47 insertions(+), 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index df795deffe77..4917de726bdb 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -418,8 +418,6 @@ MODULE_PARM_DESC(single_major, "Use a single major number for all rbd devices (d static int rbd_img_request_submit(struct rbd_img_request *img_request); -static void rbd_dev_device_release(struct device *dev); - static ssize_t rbd_add(struct bus_type *bus, const char *buf, size_t count); static ssize_t rbd_remove(struct bus_type *bus, const char *buf, @@ -4038,6 +4036,25 @@ static void rbd_spec_free(struct kref *kref) kfree(spec); } +static void rbd_dev_release(struct device *dev) +{ + struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); + bool need_put = !!rbd_dev->opts; + + rbd_put_client(rbd_dev->rbd_client); + rbd_spec_put(rbd_dev->spec); + kfree(rbd_dev->opts); + kfree(rbd_dev); + + /* +* This is racy, but way better than putting module outside of +* the release callback. The race window is pretty small, so +* doing something similar to dm (dm-builtin.c) is overkill. +*/ + if (need_put) + module_put(THIS_MODULE); +} + static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, struct rbd_spec *spec, struct rbd_options *opts) @@ -4054,6 +4071,12 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, INIT_LIST_HEAD(_dev->node); init_rwsem(_dev->header_rwsem); + rbd_dev->dev.bus = _bus_type; + rbd_dev->dev.type = _device_type; + rbd_dev->dev.parent = _root_dev; + rbd_dev->dev.release = rbd_dev_release; + device_initialize(_dev->dev); + rbd_dev->rbd_client = rbdc; rbd_dev->spec = spec; rbd_dev->opts = opts; @@ -4065,15 +4088,21 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, rbd_dev->layout.fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER); rbd_dev->layout.fl_pg_pool = cpu_to_le32((u32) spec->pool_id); + /* +* If this is a mapping rbd_dev (as opposed to a parent one), +* pin our module. We have a ref from do_rbd_add(), so use +* __module_get(). +*/ + if (rbd_dev->opts) + __module_get(THIS_MODULE); + return rbd_dev; } static void rbd_dev_destroy(struct rbd_device *rbd_dev) { - rbd_put_client(rbd_dev->rbd_client); - rbd_spec_put(rbd_dev->spec); - kfree(rbd_dev->opts); - kfree(rbd_dev); + if (rbd_dev) + put_device(_dev->dev); } /* @@ -4699,27 +4728,6 @@ static int rbd_dev_header_info(struct rbd_device *rbd_dev) return rbd_dev_v2_header_info(rbd_dev); } -static int rbd_bus_add_dev(struct rbd_device *rbd_dev) -{ - struct device *dev; - int ret; - - dev = _dev->dev; - dev->bus = _bus_type; - dev->type = _device_type; - dev->parent = _root_dev; - dev->release = rbd_dev_device_release; - dev_set_name(dev, "%d", rbd_dev->dev_id); - ret = device_register(dev); - - return ret; -} - -static void rbd_bus_del_dev(struct rbd_device *rbd_dev) -{ - device_unregister(_dev->dev); -} - /* * Get a unique rbd identifier for the given new rbd_dev, and add * the rbd_dev to the global list. @@ -5222,7 +5230,8 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev) set_capacity(rbd_dev->disk, rbd_dev->mapping.size /