Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread Stefan Hajnoczi
On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> One of Sage's possible plans for Manilla integration would use nfs over the 
> new Linux  vmci sockets transport integration in qemu (below) to access 
> Cephfs via an nfs-ganesha server running in the host vm.

Excellent job!  Nice to see you were able to add AF_VSOCK support to
nfs-ganesha so quickly.

I'm currently working on kernel nfsd support and will send the patches
to linux-nfs and CC you.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: chooseleaf may cause some unnecessary pg migrations

2015-10-19 Thread Xusangdi
I just realized the measurement I mentioned last time is not precise. It should 
be 'number of changed mappings' instead of 'number of remapped PGs'.
For example, [2,1,7] -> [0,7,4] should be regarded differently from [2,1,7] -> 
[1,7,4], as the first change causes doubled data transfer.
Is this the reason for your test results?

> -Original Message-
> From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com]
> Sent: Monday, October 19, 2015 3:34 PM
> To: Sage Weil; xusangdi 11976 (RD)
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: chooseleaf may cause some unnecessary pg migrations
>
> Thanks sage.
>
> Retesting by using --test --weight ${rand} 0 , still in 40 OSDs, 10 per host 
> case:
>
> New code average on 204.31 while old code average on 202.31
>
>
> > -Original Message-
> > From: Sage Weil [mailto:s...@newdream.net]
> > Sent: Monday, October 19, 2015 10:18 AM
> > To: Xusangdi
> > Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
> > Subject: RE: chooseleaf may cause some unnecessary pg migrations
> >
> > On Mon, 19 Oct 2015, Xusangdi wrote:
> > >
> > > > -Original Message-
> > > > From: ceph-devel-ow...@vger.kernel.org
> > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chen,
> > > > Xiaoxi
> > > > Sent: Monday, October 19, 2015 9:11 AM
> > > > To: xusangdi 11976 (RD)
> > > > Cc: ceph-devel@vger.kernel.org
> > > > Subject: RE: chooseleaf may cause some unnecessary pg migrations
> > > >
> > > > Sorry but not following...
> > > >
> > > > > then shut down one or more osds (please don't touch the
> > > > > crushmap, just stop the osd service or kill
> > > > its process).
> > > >
> > > > In this case, OSD is only down but not out, but will be marked out
> > > > after
> > 300s.
> > > >
> > > > So in what case your patch is helping?
> > > >
> > > >   If you said your patch helps on "down and out" , then my
> > > > experiment is exactly the case,
> > > >
> > >
> > > I am afraid it is probably not. Could you tell me how did you
> > > simulate the osd "down and out" situation using crushtool? If it was
> > > done by arguments such as '--remove-item' or 'reweight-item', it
> > > modified the crushmap and is not what I'm aiming for.
> >
> > There is a --weight argument (noted in usage near --test, which is the
> > only piece that uses it).  The crush map is not modified--only the
> > weight vector that is passed in when a mapping is calculated (which is
> > the equivalent of the in/out state in Ceph's OSDMap).  This should let you 
> > simulate this case.
> >
> > When I'm debugging/understanding these issues I usually change the
> > dprintk #define at the top of crush/mapper.c and use crushtool or
> > osdmaptool to calculate a single mapping, comparing the log before and
> > after a particular change.
> >
> > sage
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

RE: chooseleaf may cause some unnecessary pg migrations

2015-10-19 Thread Chen, Xiaoxi
Thanks sage.

Retesting by using --test --weight ${rand} 0 , still in 40 OSDs, 10 per host 
case:

New code average on 204.31 while old code average on 202.31


> -Original Message-
> From: Sage Weil [mailto:s...@newdream.net]
> Sent: Monday, October 19, 2015 10:18 AM
> To: Xusangdi
> Cc: Chen, Xiaoxi; ceph-devel@vger.kernel.org
> Subject: RE: chooseleaf may cause some unnecessary pg migrations
> 
> On Mon, 19 Oct 2015, Xusangdi wrote:
> >
> > > -Original Message-
> > > From: ceph-devel-ow...@vger.kernel.org
> > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chen, Xiaoxi
> > > Sent: Monday, October 19, 2015 9:11 AM
> > > To: xusangdi 11976 (RD)
> > > Cc: ceph-devel@vger.kernel.org
> > > Subject: RE: chooseleaf may cause some unnecessary pg migrations
> > >
> > > Sorry but not following...
> > >
> > > > then shut down one or more osds (please don't touch the crushmap,
> > > > just stop the osd service or kill
> > > its process).
> > >
> > > In this case, OSD is only down but not out, but will be marked out after
> 300s.
> > >
> > > So in what case your patch is helping?
> > >
> > >   If you said your patch helps on "down and out" , then my
> > > experiment is exactly the case,
> > >
> >
> > I am afraid it is probably not. Could you tell me how did you simulate
> > the osd "down and out" situation using crushtool? If it was done by
> > arguments such as '--remove-item' or 'reweight-item', it modified the
> > crushmap and is not what I'm aiming for.
> 
> There is a --weight argument (noted in usage near --test, which is the only
> piece that uses it).  The crush map is not modified--only the weight vector
> that is passed in when a mapping is calculated (which is the equivalent of the
> in/out state in Ceph's OSDMap).  This should let you simulate this case.
> 
> When I'm debugging/understanding these issues I usually change the dprintk
> #define at the top of crush/mapper.c and use crushtool or osdmaptool to
> calculate a single mapping, comparing the log before and after a particular
> change.
> 
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()

2015-10-19 Thread Alex Elder
On 10/18/2015 09:49 PM, Shraddha Barke wrote:
> Use local variable cursor in place of >cursor in
>  read_partial_msg_data()
> 
> Signed-off-by: Shraddha Barke 

This is a pretty minor comment, but the "Net" in your
subject line is probably better *not* capitalized.

-Alex

> ---
> Changes in v2-
>  Drop incorrect use of cursor
> 
>  net/ceph/messenger.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..b087edd 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>   if (do_datacrc)
>   crc = con->in_data_crc;
>   while (cursor->resid) {
> - page = ceph_msg_data_next(>cursor, _offset, ,
> + page = ceph_msg_data_next(cursor, _offset, ,
>   NULL);
>   ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
>   if (ret <= 0) {
> @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>  
>   if (do_datacrc)
>   crc = ceph_crc32c_page(crc, page, page_offset, ret);
> - (void) ceph_msg_data_advance(>cursor, (size_t)ret);
> + (void) ceph_msg_data_advance(cursor, (size_t)ret);
>   }
>   if (do_datacrc)
>   con->in_data_crc = crc;
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()

2015-10-19 Thread Ilya Dryomov
On Mon, Oct 19, 2015 at 4:49 AM, Shraddha Barke  wrote:
> Use local variable cursor in place of >cursor in
>  read_partial_msg_data()
>
> Signed-off-by: Shraddha Barke 
> ---
> Changes in v2-
>  Drop incorrect use of cursor
>
>  net/ceph/messenger.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..b087edd 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
> if (do_datacrc)
> crc = con->in_data_crc;
> while (cursor->resid) {
> -   page = ceph_msg_data_next(>cursor, _offset, ,
> +   page = ceph_msg_data_next(cursor, _offset, ,
> NULL);
> ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
> if (ret <= 0) {
> @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>
> if (do_datacrc)
> crc = ceph_crc32c_page(crc, page, page_offset, ret);
> -   (void) ceph_msg_data_advance(>cursor, (size_t)ret);
> +   (void) ceph_msg_data_advance(cursor, (size_t)ret);
> }
> if (do_datacrc)
> con->in_data_crc = crc;

This looks good.  I'd like you to update this patch to do the same for
write_partial_message_data().

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2015-10-19 Thread ceph branch robot
-- All Branches --

Adam C. Emerson 
2015-10-16 13:49:09 -0400   wip-cxx11time
2015-10-17 13:20:15 -0400   wip-cxx11concurrency

Adam Crume 
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza 
2015-03-23 16:39:48 -0400   wip-11212

Alfredo Deza 
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Ali Maredia 
2015-10-12 14:28:30 -0400   wip-10587-split-servers
2015-10-13 13:40:41 -0400   wip-infernalis-cmake
2015-10-15 18:39:23 -0400   wip-cmake

Barbora Ančincová 
2015-10-14 12:36:24 +0200   wip-doc-RGW

Boris Ranto 
2015-09-04 15:19:11 +0200   wip-bash-completion

Casey Bodley 
2015-09-28 17:09:11 -0400   wip-cxx14-test
2015-09-29 15:18:17 -0400   wip-fio-objectstore

Dan Mick 
2013-07-16 23:00:06 -0700   wip-5634

Danny Al-Gaaf 
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-09-28 16:05:12 +0200   wip-da-SCA-20150910

David Zafman 
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-08-04 07:39:00 -0700   wip-12577-hammer
2015-09-28 11:33:11 -0700   wip-12983
2015-10-15 12:30:37 -0700   wip-13016
2015-10-18 14:48:01 -0700   wip-zafman-testing

Dongmao Zhang 
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum 
2015-04-29 21:44:11 -0700   wip-init-names
2015-07-16 09:28:24 -0700   hammer-12297
2015-10-02 13:00:59 -0700   greg-infernalis-lock-testing
2015-10-02 13:09:05 -0700   greg-infernalis-lock-testing-cacher
2015-10-07 00:45:24 -0700   greg-infernalis-fs
2015-10-13 21:41:35 -0700   greg-fs-testing

Greg Farnum 
2014-10-23 13:33:44 -0700   wip-forward-scrub

Guang G Yang 
2015-06-26 20:31:44 +   wip-ec-readall
2015-07-23 16:13:19 +   wip-12316

Guang Yang 
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang 
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-07-21 19:33:56 +0800   fio-objectstore
2015-08-26 09:57:27 +0800   wip-recovery-attr

Ilya Dryomov 
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

Ivo Jimenez 
2015-08-24 23:12:45 -0700   hammer-with-new-workunit-for-wip-12551

Jason Dillaman 
2015-07-31 13:55:23 -0400   wip-12383-next
2015-08-31 23:17:53 -0400   wip-12698
2015-09-01 10:17:02 -0400   wip-11287
2015-10-15 00:15:54 -0400   wip-13274-infernalis

Jenkins 
2015-09-30 12:59:03 -0700   rhcs-v0.94.3-ubuntu

Jenkins 
2014-07-29 05:24:39 -0700   wip-nhm-hang
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-08-21 12:46:32 -0700   last
2015-08-21 12:46:32 -0700   loic-v9.0.3
2015-09-15 10:23:18 -0700   rhcs-v0.80.8
2015-09-21 16:48:32 -0700   rhcs-v0.94.1-ubuntu

Joao Eduardo Luis 
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis 
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis 
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling
2015-07-27 21:56:42 +0100   wip-11470.hammer
2015-09-09 15:45:45 +0100   wip-11786.hammer

Joao Eduardo Luis 
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   wip-mon-paxos-fix
2015-01-26 13:00:09 +   wip-mon-datahealth-fix
2015-02-04 22:36:14 +   wip-10643
2015-09-09 15:43:51 +0100   wip-11786.firefly

Joao Eduardo Luis 
2015-05-27 23:48:45 +0100   wip-mon-scrub
2015-05-29 12:21:43 +0100   wip-11545
2015-06-05 16:12:57 +0100   

Re: MDS stuck in a crash loop

2015-10-19 Thread Milosz Tanski
On Wed, Oct 14, 2015 at 9:21 AM, John Spray  wrote:
> On Mon, Oct 12, 2015 at 3:36 AM, Milosz Tanski  wrote:
>> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski  wrote:
>>> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski  wrote:
 On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski  wrote:
> On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski  wrote:
>> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum  
>> wrote:
>>> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski  
>>> wrote:
 About an hour ago my MDSs (primary and follower) started ping-pong
 crashing with this message. I've spent about 30 minutes looking into
 it but nothing yet.

 This is from a 0.94.3 MDS

>>>
  0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc:
 In function 'virtual void C_IO_SM_Save::finish(int)' thread
 7fd4f52ad700 time 2015-10-11 17:01:23.594089
 mds/SessionMap.cc: 120: FAILED assert(r == 0)
>>>
>>> These "r == 0" asserts pretty much always mean that the MDS did did a
>>> read or write to RADOS (the OSDs) and got an error of some kind back.
>>> (Or in the case of the OSDs, access to the local filesystem returned
>>> an error, etc.) I don't think these writes include any safety checks
>>> which would let the MDS break it which means that probably the OSD is
>>> actually returning an error — odd, but not impossible.
>>>
>>> Notice that the assert happened in thread 7fd4f52ad700, and look for
>>> the stuff in that thread. You should be able to find an OSD op reply
>>> (on the SessionMap object) coming in and reporting an error code.
>>> -Greg
>>
>> I only two error ops in that whole MDS session. Neither one happened
>> on the same thread (7f5ab6000700 in this file). But it looks like the
>> only session map is the -90 "Message too long" one.
>>
>> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v
>> 'ondisk = 0'
>>  -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700  1 --
>> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 
>> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0
>> ondisk = -90 ((90) Message too long)) v6  182+0+0 (2955408122 0 0)
>> 0x3a55d340 con 0x3d5a3c0
>>   -705> 2015-10-11 20:51:11.374132 7f5ab22f4700  1 --
>> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 
>> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2
>> ((2) No such file or directory)) v6  179+0+0 (1182549251 0 0)
>> 0x66c5c80 con 0x3d5a7e0
>>
>> Any idea what this could be Greg?
>
> To follow this up I found this ticket from 9 months ago:
> http://tracker.ceph.com/issues/10449 In there Yan says:
>
> "it's a kernel bug. hang request prevents mds from trimming
> completed_requests in sessionmap. there is nothing to do with mds.
> (maybe we should add some code to MDS to show warning when this bug
> happens)"
>
> When I was debugging this I saw an OSD (not cephfs client) operation
> stuck for a long time along with the MDS error:
>
> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow
> requests; mds cluster is degraded; mds0: Behind on trimming (709/30)
> 1 ops are blocked > 16777.2 sec
> 1 ops are blocked > 16777.2 sec on osd.28
>
> I did eventually bounce the OSD in question and it hasn't become stuck
> since, but the MDS is still eating it every time with the "Message too
> long" error on the session map.
>
> I'm not quite sure where to go from here.

 First time I had a chance to use the new recover tools. I was able to
 reply the journal, reset it and then reset the sessionmap. MDS
 returned back to life and so far everything looks good. Yay.

 Triggering this a bug/issue is a pretty interesting set of steps.
>>>
>>> Spoke too soon, a missing dir is now causing MDS to restart it self.
>>>
>>> -6> 2015-10-11 22:40:47.300169 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request,
>>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>> -5> 2015-10-11 22:40:47.300208 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request,
>>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>> -4> 2015-10-11 22:40:47.300231 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op:
>>> client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>> -3> 2015-10-11 22:40:47.300284 

Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread J. Bruce Fields
On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> Hi devs (CC Bruce--here is a use case for vmci sockets transport)
> 
> One of Sage's possible plans for Manilla integration would use nfs over the 
> new Linux  vmci sockets transport integration in qemu (below) to access 
> Cephfs via an nfs-ganesha server running in the host vm.

What does "the host vm" mean, and why is this a particularly useful
configuration?

--b.

> 
> This now experimentally works.
> 
> some notes on running nfs-ganesha over AF_VSOCK:
> 
> 1. need stefan hajnoczi's patches for
> * linux kernel (and build w/vhost-vsock support
> * qemu (and build w/vhost-vsock support)
> * nfs-utils (in vm guest)
> 
> all linked from https://github.com/stefanha?tab=repositories
> 
> 2. host and vm guest kernels must include vhost-vsock
> * host kernel should load vhost-vsock.ko
> 
> 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, 
> e.g
> 
> /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 
> --enable-kvm -drive 
> file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive 
> file=/opt/isos/f22.iso,media=cdrom -net 
> nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 
> -parallel none -serial mon:stdio -device 
> vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
> 
> 4. nfs-gansha (in host)
> * need nfs-ganesha and its ntirpc rpc provider with vsock support
> https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> https://github.com/linuxbox2/ntirpc (vsock branch)
> 
> * configure ganesha w/vsock support
> cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK 
> -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
> 
> in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
> 
> 5. mount in guest w/nfs41:
> (e.g., in fstab)
> 2:// /vsock41 nfs 
> noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
>  0 0
> 
> If you try this, send feedback.
> 
> Thanks!
> 
> Matt
> 
> -- 
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
> 
> http://www.redhat.com/en/technologies/storage
> 
> tel.  734-707-0660
> fax.  734-769-8938
> cel.  734-216-5309
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS stuck in a crash loop

2015-10-19 Thread Milosz Tanski
On Wed, Oct 14, 2015 at 12:46 AM, Gregory Farnum  wrote:
> On Sun, Oct 11, 2015 at 7:36 PM, Milosz Tanski  wrote:
>> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski  wrote:
>>> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski  wrote:
 On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski  wrote:
> On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski  wrote:
>> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum  
>> wrote:
>>> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski  
>>> wrote:
 About an hour ago my MDSs (primary and follower) started ping-pong
 crashing with this message. I've spent about 30 minutes looking into
 it but nothing yet.

 This is from a 0.94.3 MDS

>>>
  0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc:
 In function 'virtual void C_IO_SM_Save::finish(int)' thread
 7fd4f52ad700 time 2015-10-11 17:01:23.594089
 mds/SessionMap.cc: 120: FAILED assert(r == 0)
>>>
>>> These "r == 0" asserts pretty much always mean that the MDS did did a
>>> read or write to RADOS (the OSDs) and got an error of some kind back.
>>> (Or in the case of the OSDs, access to the local filesystem returned
>>> an error, etc.) I don't think these writes include any safety checks
>>> which would let the MDS break it which means that probably the OSD is
>>> actually returning an error — odd, but not impossible.
>>>
>>> Notice that the assert happened in thread 7fd4f52ad700, and look for
>>> the stuff in that thread. You should be able to find an OSD op reply
>>> (on the SessionMap object) coming in and reporting an error code.
>>> -Greg
>>
>> I only two error ops in that whole MDS session. Neither one happened
>> on the same thread (7f5ab6000700 in this file). But it looks like the
>> only session map is the -90 "Message too long" one.
>>
>> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v
>> 'ondisk = 0'
>>  -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700  1 --
>> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 
>> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0
>> ondisk = -90 ((90) Message too long)) v6  182+0+0 (2955408122 0 0)
>> 0x3a55d340 con 0x3d5a3c0
>>   -705> 2015-10-11 20:51:11.374132 7f5ab22f4700  1 --
>> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 
>> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2
>> ((2) No such file or directory)) v6  179+0+0 (1182549251 0 0)
>> 0x66c5c80 con 0x3d5a7e0
>>
>> Any idea what this could be Greg?
>
> To follow this up I found this ticket from 9 months ago:
> http://tracker.ceph.com/issues/10449 In there Yan says:
>
> "it's a kernel bug. hang request prevents mds from trimming
> completed_requests in sessionmap. there is nothing to do with mds.
> (maybe we should add some code to MDS to show warning when this bug
> happens)"
>
> When I was debugging this I saw an OSD (not cephfs client) operation
> stuck for a long time along with the MDS error:
>
> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow
> requests; mds cluster is degraded; mds0: Behind on trimming (709/30)
> 1 ops are blocked > 16777.2 sec
> 1 ops are blocked > 16777.2 sec on osd.28
>
> I did eventually bounce the OSD in question and it hasn't become stuck
> since, but the MDS is still eating it every time with the "Message too
> long" error on the session map.
>
> I'm not quite sure where to go from here.

 First time I had a chance to use the new recover tools. I was able to
 reply the journal, reset it and then reset the sessionmap. MDS
 returned back to life and so far everything looks good. Yay.

 Triggering this a bug/issue is a pretty interesting set of steps.
>>>
>>> Spoke too soon, a missing dir is now causing MDS to restart it self.
>>>
>>> -6> 2015-10-11 22:40:47.300169 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request,
>>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>> -5> 2015-10-11 22:40:47.300208 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request,
>>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>> -4> 2015-10-11 22:40:47.300231 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op:
>>> client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>> -3> 2015-10-11 22:40:47.300284 

Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread Matt Benjamin
Hi Bruce,

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "J. Bruce Fields" 
> To: "Matt Benjamin" 
> Cc: "Ceph Development" , "Stefan Hajnoczi" 
> , "Sage Weil"
> 
> Sent: Monday, October 19, 2015 11:58:45 AM
> Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha)
> 
> On Mon, Oct 19, 2015 at 11:49:15AM -0400, Matt Benjamin wrote:
> > - Original Message -
> > > From: "J. Bruce Fields" 
> ...
> > > 
> > > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> > > > Hi devs (CC Bruce--here is a use case for vmci sockets transport)
> > > > 
> > > > One of Sage's possible plans for Manilla integration would use nfs over
> > > > the
> > > > new Linux  vmci sockets transport integration in qemu (below) to access
> > > > Cephfs via an nfs-ganesha server running in the host vm.
> > > 
> > > What does "the host vm" mean, and why is this a particularly useful
> > > configuration?
> > 
> > Sorry, I should say, "the vm host."
> 
> Got it, thanks!
> 
> > I think the claimed utility here is (at least) three-fold:
> > 
> > 1. simplified configuration on host and guests
> > 2. some claim to improved security through isolation
> 
> So why is it especially interesting to put Ceph inside the VM and
> Ganesha outside?

Oh, sorry.  Here Ceph (or Gluster, or, whatever underlying FS provider) is 
conceptually outside the vm complex altogether, Ganesha is re-exporting on the 
vm host, and guests access the namespace using NFS(v41).

Regards,

Matt

> 
> > 3. some expectation of improved latency/performance wrt TCP
> > 
> > Stefan sent a link to a set of slides with his original patches.  Did you
> > get a chance to read through those?
> > 
> > [1]
> > http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf
> 
> Yep, thanks.--b.
> 
> > 
> > Regards,
> > 
> > Matt
> > 
> > > 
> > > --b.
> > > 
> > > > 
> > > > This now experimentally works.
> > > > 
> > > > some notes on running nfs-ganesha over AF_VSOCK:
> > > > 
> > > > 1. need stefan hajnoczi's patches for
> > > > * linux kernel (and build w/vhost-vsock support
> > > > * qemu (and build w/vhost-vsock support)
> > > > * nfs-utils (in vm guest)
> > > > 
> > > > all linked from https://github.com/stefanha?tab=repositories
> > > > 
> > > > 2. host and vm guest kernels must include vhost-vsock
> > > > * host kernel should load vhost-vsock.ko
> > > > 
> > > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci
> > > > device, e.g
> > > > 
> > > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1
> > > > --enable-kvm -drive
> > > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive
> > > > file=/opt/isos/f22.iso,media=cdrom -net
> > > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0
> > > > -parallel none -serial mon:stdio -device
> > > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
> > > > 
> > > > 4. nfs-gansha (in host)
> > > > * need nfs-ganesha and its ntirpc rpc provider with vsock support
> > > > https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> > > > https://github.com/linuxbox2/ntirpc (vsock branch)
> > > > 
> > > > * configure ganesha w/vsock support
> > > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON
> > > > -DUSE_VSOCK
> > > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
> > > > 
> > > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
> > > > 
> > > > 5. mount in guest w/nfs41:
> > > > (e.g., in fstab)
> > > > 2:// /vsock41 nfs
> > > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
> > > > 0 0
> > > > 
> > > > If you try this, send feedback.
> > > > 
> > > > Thanks!
> > > > 
> > > > Matt
> > > > 
> > > > --
> > > > Matt Benjamin
> > > > Red Hat, Inc.
> > > > 315 West Huron Street, Suite 140A
> > > > Ann Arbor, Michigan 48103
> > > > 
> > > > http://www.redhat.com/en/technologies/storage
> > > > 
> > > > tel.  734-707-0660
> > > > fax.  734-769-8938
> > > > cel.  734-216-5309
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread John Spray
On Fri, Oct 16, 2015 at 10:08 PM, Matt Benjamin  wrote:
> Hi devs (CC Bruce--here is a use case for vmci sockets transport)
>
> One of Sage's possible plans for Manilla integration would use nfs over the 
> new Linux  vmci sockets transport integration in qemu (below) to access 
> Cephfs via an nfs-ganesha server running in the host vm.
>
> This now experimentally works.

Very cool!  Thank you for the detailed instructions, I look forward to
trying this out soon.

John

> some notes on running nfs-ganesha over AF_VSOCK:
>
> 1. need stefan hajnoczi's patches for
> * linux kernel (and build w/vhost-vsock support
> * qemu (and build w/vhost-vsock support)
> * nfs-utils (in vm guest)
>
> all linked from https://github.com/stefanha?tab=repositories
>
> 2. host and vm guest kernels must include vhost-vsock
> * host kernel should load vhost-vsock.ko
>
> 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, 
> e.g
>
> /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 
> --enable-kvm -drive 
> file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive 
> file=/opt/isos/f22.iso,media=cdrom -net 
> nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 
> -parallel none -serial mon:stdio -device 
> vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
>
> 4. nfs-gansha (in host)
> * need nfs-ganesha and its ntirpc rpc provider with vsock support
> https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> https://github.com/linuxbox2/ntirpc (vsock branch)
>
> * configure ganesha w/vsock support
> cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK 
> -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
>
> in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
>
> 5. mount in guest w/nfs41:
> (e.g., in fstab)
> 2:// /vsock41 nfs 
> noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
>  0 0
>
> If you try this, send feedback.
>
> Thanks!
>
> Matt
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-707-0660
> fax.  734-769-8938
> cel.  734-216-5309
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread Matt Benjamin
Hi Bruce,

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309

- Original Message -
> From: "J. Bruce Fields" 
> To: "Matt Benjamin" 
> Cc: "Ceph Development" , "Stefan Hajnoczi" 
> , "Sage Weil"
> 
> Sent: Monday, October 19, 2015 11:13:52 AM
> Subject: Re: nfsv41 over AF_VSOCK (nfs-ganesha)
> 
> On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> > Hi devs (CC Bruce--here is a use case for vmci sockets transport)
> > 
> > One of Sage's possible plans for Manilla integration would use nfs over the
> > new Linux  vmci sockets transport integration in qemu (below) to access
> > Cephfs via an nfs-ganesha server running in the host vm.
> 
> What does "the host vm" mean, and why is this a particularly useful
> configuration?

Sorry, I should say, "the vm host."

I think the claimed utility here is (at least) three-fold:

1. simplified configuration on host and guests
2. some claim to improved security through isolation
3. some expectation of improved latency/performance wrt TCP

Stefan sent a link to a set of slides with his original patches.  Did you get a 
chance to read through those?

[1] 
http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf

Regards,

Matt

> 
> --b.
> 
> > 
> > This now experimentally works.
> > 
> > some notes on running nfs-ganesha over AF_VSOCK:
> > 
> > 1. need stefan hajnoczi's patches for
> > * linux kernel (and build w/vhost-vsock support
> > * qemu (and build w/vhost-vsock support)
> > * nfs-utils (in vm guest)
> > 
> > all linked from https://github.com/stefanha?tab=repositories
> > 
> > 2. host and vm guest kernels must include vhost-vsock
> > * host kernel should load vhost-vsock.ko
> > 
> > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci
> > device, e.g
> > 
> > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1
> > --enable-kvm -drive
> > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive
> > file=/opt/isos/f22.iso,media=cdrom -net
> > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0
> > -parallel none -serial mon:stdio -device
> > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
> > 
> > 4. nfs-gansha (in host)
> > * need nfs-ganesha and its ntirpc rpc provider with vsock support
> > https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> > https://github.com/linuxbox2/ntirpc (vsock branch)
> > 
> > * configure ganesha w/vsock support
> > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK
> > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
> > 
> > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
> > 
> > 5. mount in guest w/nfs41:
> > (e.g., in fstab)
> > 2:// /vsock41 nfs
> > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
> > 0 0
> > 
> > If you try this, send feedback.
> > 
> > Thanks!
> > 
> > Matt
> > 
> > --
> > Matt Benjamin
> > Red Hat, Inc.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> > 
> > http://www.redhat.com/en/technologies/storage
> > 
> > tel.  734-707-0660
> > fax.  734-769-8938
> > cel.  734-216-5309
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread J. Bruce Fields
On Mon, Oct 19, 2015 at 11:49:15AM -0400, Matt Benjamin wrote:
> - Original Message -
> > From: "J. Bruce Fields" 
...
> > 
> > On Fri, Oct 16, 2015 at 05:08:17PM -0400, Matt Benjamin wrote:
> > > Hi devs (CC Bruce--here is a use case for vmci sockets transport)
> > > 
> > > One of Sage's possible plans for Manilla integration would use nfs over 
> > > the
> > > new Linux  vmci sockets transport integration in qemu (below) to access
> > > Cephfs via an nfs-ganesha server running in the host vm.
> > 
> > What does "the host vm" mean, and why is this a particularly useful
> > configuration?
> 
> Sorry, I should say, "the vm host."

Got it, thanks!

> I think the claimed utility here is (at least) three-fold:
> 
> 1. simplified configuration on host and guests
> 2. some claim to improved security through isolation

So why is it especially interesting to put Ceph inside the VM and
Ganesha outside?

> 3. some expectation of improved latency/performance wrt TCP
> 
> Stefan sent a link to a set of slides with his original patches.  Did you get 
> a chance to read through those?
> 
> [1] 
> http://events.linuxfoundation.org/sites/events/files/slides/stefanha-kvm-forum-2015.pdf

Yep, thanks.--b.

> 
> Regards,
> 
> Matt
> 
> > 
> > --b.
> > 
> > > 
> > > This now experimentally works.
> > > 
> > > some notes on running nfs-ganesha over AF_VSOCK:
> > > 
> > > 1. need stefan hajnoczi's patches for
> > > * linux kernel (and build w/vhost-vsock support
> > > * qemu (and build w/vhost-vsock support)
> > > * nfs-utils (in vm guest)
> > > 
> > > all linked from https://github.com/stefanha?tab=repositories
> > > 
> > > 2. host and vm guest kernels must include vhost-vsock
> > > * host kernel should load vhost-vsock.ko
> > > 
> > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci
> > > device, e.g
> > > 
> > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1
> > > --enable-kvm -drive
> > > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive
> > > file=/opt/isos/f22.iso,media=cdrom -net
> > > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0
> > > -parallel none -serial mon:stdio -device
> > > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
> > > 
> > > 4. nfs-gansha (in host)
> > > * need nfs-ganesha and its ntirpc rpc provider with vsock support
> > > https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> > > https://github.com/linuxbox2/ntirpc (vsock branch)
> > > 
> > > * configure ganesha w/vsock support
> > > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON 
> > > -DUSE_VSOCK
> > > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
> > > 
> > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
> > > 
> > > 5. mount in guest w/nfs41:
> > > (e.g., in fstab)
> > > 2:// /vsock41 nfs
> > > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
> > > 0 0
> > > 
> > > If you try this, send feedback.
> > > 
> > > Thanks!
> > > 
> > > Matt
> > > 
> > > --
> > > Matt Benjamin
> > > Red Hat, Inc.
> > > 315 West Huron Street, Suite 140A
> > > Ann Arbor, Michigan 48103
> > > 
> > > http://www.redhat.com/en/technologies/storage
> > > 
> > > tel.  734-707-0660
> > > fax.  734-769-8938
> > > cel.  734-216-5309
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-19 Thread Somnath Roy
Sage,
I fully support that.  If we want to saturate SSDs , we need to get rid of this 
filesystem overhead (which I am in process of measuring).
Also, it will be good if we can eliminate the dependency on the k/v dbs (for 
storing allocators and all). The reason is the unknown write amps they causes.

Thanks & Regards
Somnath


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 12:49 PM
To: ceph-devel@vger.kernel.org
Subject: newstore direction

The current design is based on two simple ideas:

 1) a key/value interface is better way to manage all of our internal metadata 
(object metadata, attrs, layout, collection membership, write-ahead logging, 
overlay data, etc.)

 2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

 - We currently write the data to the file, fsync, then commit the kv 
transaction.  That's at least 3 IOs: one for the data, one for the fs journal, 
one for the kv txn to commit (at least once my rocksdb changes land... the kv 
commit is currently 2-3).  So two people are managing metadata, here: the fs 
managing the file metadata (with its own
journal) and the kv backend (with its journal).

 - On read we have to open files by name, which means traversing the fs 
namespace.  Newstore tries to keep it as flat and simple as possible, but at a 
minimum it is a couple btree lookups.  We'd love to use open by handle (which 
would reduce this to 1 btree traversal), but running the daemon as ceph and not 
root makes that hard...

 - ...and file systems insist on updating mtime on writes, even when it is a 
overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME 
patches exist but it is hard to get these past the kernel brainfreeze.

 - XFS is (probably) never going going to give us data checksums, which we want 
desperately.

But what's the alternative?  My thought is to just bite the bullet and consume 
a raw block device directly.  Write an allocator, hopefully keep it pretty 
simple, and manage it in kv store along with all of our other metadata.

Wins:

 - 2 IOs for most: one to write the data to unused space in the block device, 
one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io 
to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
before).

 - No concern about mtime getting in the way

 - Faster reads (no fs lookup)

 - Similarly sized metadata for most objects.  If we assume most objects are 
not fragmented, then the metadata to store the block offsets is about the same 
size as the metadata to store the filenames we have now.

Problems:

 - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
index data or cephfs metadata?  Suddenly we are pulling storage out of a 
different pool and those aren't currently fungible.

 - We have to write and maintain an allocator.  I'm still optimistic this can 
be reasonbly simple, especially for the flash case (where fragmentation isn't 
such an issue as long as our blocks are reasonbly sized).  For disk we may beed 
to be moderately clever.

 - We'll need a fsck to ensure our internal metadata is consistent.  The good 
news is it'll just need to validate what we have stored in the kv store.

Other thoughts:

 - We might want to consider whether dm-thin or bcache or other block layers 
might help us with elasticity of file vs block areas.

 - Rocksdb can push colder data to a second directory, so we could have a fast 
ssd primary area (for wal and most metadata) and a second hdd directory for 
stuff it has to push off.  Then have a conservative amount of file space on the 
hdd.  If our block fills up, use the existing file mechanism to put data there 
too.  (But then we have to maintain both the current kv + file approach and not 
go all-in on kv + block.)

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored 

Re: newstore direction

2015-10-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I think there is a lot that can be gained by Ceph managing a raw block
device. As I mentioned on ceph-users, I've given this some though and
a lot of optimizations could be done that is conducive to storing
objects. I didn't think however to bypass VFS all together by opening
the raw device directly, but this would make things simpler as you
don't have to program things for VFS that don't make sense.

Some of my thoughts were to employ a hashing algorithm for inode
lookup (CRUSH like). Is there a good use case for listing a directory?
We may need to keep a list for deletion, but there may be a better way
to handle this. Is there a need to do snapshots at the block layer if
operations can be atomic? Is there a real advantage to have an
allocation as small as 4K, or does it make since to use something like
512K?

I'm interested in how this might pan out.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWJVEACRDmVDuy+mK58QAAIQEQAK9GUmGQBP1wYa9yXNEp
juofzj5SCxxiNCBdY3kkdHXELCWkLGn331JX2El8h1lPaqH8/nWNy4U6hx0s
7A5EBgQp7+LN03OLroSfiSccPhEe5B/OB1cnyZjmxwDXyaMJzqXwn231f5ev
lBEzvU5PpHrMdNIIGxNFEHgduxfPIw5ciOokP27Tle1JdAGSn6fL6nRLtQfd
HmVLnnXJT9zaGRyxnL8ZQU8IlfjfhMpIc1bM3QKkQkBmXanzCaNaULrlO35L
XtIy0fEXAjkcGHpxOTz4yx5OFKwkpirFduU2PBn+5kqxPRvGL/eEzIxTV89c
SfhAkyBFpl+g7G+q532i7L/34r2wXOL7wcn9seLdOZIt1LVnb059r0tpy4Fz
X/V2/ao1Fua2BFMYzMskPXiKFzxLu/jOS12CjvYWkNhN4C2pGUbRxhqYnC0k
gjRpoOZHDr+RogQdlzXeUmcbZzvtwWqk2uECIX2mLR1aHTVgnpegJhvvHdl3
Nm7jxLyTof2bcXQgSwO5YEXvWO3dNfQynrb5zE+aIVM5ps9D95Mmm94lJtda
47zraQNwrL1OVS7Fd4ot9VepLcQ4orCUZPSqrm5FBlBWj5G+/U0F8VQl8u/g
/nSZrxMXjHJWRhFvzFMYC3yUp59N75LXR5wId8RkAkgZVM+PftB4LmB7spHC
WcGR
=j3i1
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Oct 19, 2015 at 1:49 PM, Sage Weil  wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.
>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).
>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate 

RE: newstore direction

2015-10-19 Thread Sage Weil
On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get rid 
> of this filesystem overhead (which I am in process of measuring). Also, 
> it will be good if we can eliminate the dependency on the k/v dbs (for 
> storing allocators and all). The reason is the unknown write amps they 
> causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as 
appropriate) so that other backends can be easily swapped in (e.g. a 
btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, write-ahead 
> logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but at 
> a minimum it is a couple btree lookups.  We'd love to use open by handle 
> (which would reduce this to 1 btree traversal), but running the daemon as 
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a 
> overwrite with no allocation changes.  (We don't care about mtime.) 
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep it 
> pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, 
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one 
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are 
> not fragmented, then the metadata to store the block offsets is about the 
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
> index data or cephfs metadata?  Suddenly we are pulling storage out of a 
> different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be reasonbly simple, especially for the flash case (where fragmentation isn't 
> such an issue as long as our blocks are reasonbly sized).  For disk we may 
> beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The good 
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block layers 
> might help us with elasticity of file vs block areas.
> 
>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd directory 
> for stuff it has to push off.  Then have a conservative amount of file space 
> on the hdd.  If our block fills up, use the existing file mechanism to put 
> data there too.  (But then we have to maintain both the current kv + file 
> approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this 

v0.94.4 Hammer released

2015-10-19 Thread Sage Weil
This Hammer point fixes several important bugs in Hammer, as well as
fixing interoperability issues that are required before an upgrade to
Infernalis. That is, all users of earlier version of Hammer or any
version of Firefly will first need to upgrade to hammer v0.94.4 or
later before upgrading to Infernalis (or future releases).

All v0.94.x Hammer users are strongly encouraged to upgrade.

Changes
---

* build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166, Nathan 
Cutler)
* build/ops: ceph.spec.in: ceph-common needs python-argparse on older distros, 
but doesn't require it (#12034, Nathan Cutler)
* build/ops: ceph.spec.in: radosgw requires apache for SUSE only -- makes no 
sense (#12358, Nathan Cutler)
* build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized (#11991, 
Nathan Cutler)
* build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992, Owen 
Synge)
* build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan Cutler)
* build/ops: ceph.spec.in: snappy-devel for all supported distros (#12361, 
Nathan Cutler)
* build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel (#11629, 
Nathan Cutler)
* build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build (#12351, 
Nathan Cutler)
* build/ops: error in ext_mime_map_init() when /etc/mime.types is missing 
(#11864, Ken Dreyer)
* build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s) 
(#11798, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#10927, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#11140, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#11686, Sage Weil)
* build/ops: With root as default user, unable to have multiple RGW instances 
running (#12407, Sage Weil)
* cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu Chai)
* cli: ceph tell: broken error message / misleading hinting (#11101, Kefu Chai)
* common: arm: all programs that link to librados2 hang forever on startup 
(#12505, Boris Ranto)
* common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
* common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer objects (#13070, 
Sage Weil)
* common: do not insert emtpy ptr when rebuild emtpy bufferlist (#12775, Xinze 
Chi)
* common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
* common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
* common: Memory leak in Mutex.cc, pthread_mutexattr_init without 
pthread_mutexattr_destroy (#11762, Ketor Meng)
* common: object_map_update fails with -EINVAL return code (#12611, Jason 
Dillaman)
* common: Pipe: Drop connect_seq increase line (#13093, Haomai Wang)
* common: recursive lock of md_config_t (0) (#12614, Josh Durgin)
* crush: ceph osd crush reweight-subtree does not reweight parent node (#11855, 
Sage Weil)
* doc: update docs to point to download.ceph.com (#13162, Alfredo Deza)
* fs: ceph-fuse 0.94.2-1trusty segfaults / aborts (#12297, Greg Farnum)
* fs: segfault launching ceph-fuse with bad --name (#12417, John Spray)
* librados: Change radosgw pools default crush ruleset (#11640, Yuan Zhou)
* librbd: correct issues discovered via lockdep / helgrind (#12345, Jason 
Dillaman)
* librbd: Crash during TestInternal.MultipleResize (#12664, Jason Dillaman)
* librbd: deadlock during cooperative exclusive lock transition (#11537, Jason 
Dillaman)
* librbd: Possible crash while concurrently writing and shrinking an image 
(#11743, Jason Dillaman)
* mon: add a cache layer over MonitorDBStore (#12638, Kefu Chai)
* mon: fix crush testing for new pools (#13400, Sage Weil)
* mon: get pools health'info have error (#12402, renhwztetecs)
* mon: implicit erasure code crush ruleset is not validated (#11814, Loic 
Dachary)
* mon: PaxosService: call post_refresh() instead of post_paxos_update() 
(#11470, Joao Eduardo Luis)
* mon: pgmonitor: wrong at/near target max“ reporting (#12401, huangjun)
* mon: register_new_pgs() should check ruleno instead of its index (#12210, 
Xinze Chi)
* mon: Show osd as NONE in ceph osd mapoutput (#11820, 
Shylesh Kumar)
* mon: the output is wrong when runing ceph osd reweight (#12251, Joao Eduardo 
Luis)
* osd: allow peek_map_epoch to return an error (#13060, Sage Weil)
* osd: cache agent is idle although one object is left in the cache (#12673, 
Loic Dachary)
* osd: copy-from doesn't preserve truncate_{seq,size} (#12551, Samuel Just)
* osd: crash creating/deleting pools (#12429, John Spray)
* osd: fix repair when recorded digest is wrong (#12577, Sage Weil)
* osd: include/ceph_features: define HAMMER_0_94_4 feature (#13026, Sage Weil)
* osd: is_new_interval() fixes (#10399, Jason Dillaman)
* osd: is_new_interval() fixes (#11771, Jason Dillaman)
* osd: long standing slow requests: 
connection->session->waiting_for_map->connection ref cycle (#12338, Samuel Just)
* osd: Mutex Assert from 

Re: newstore direction

2015-10-19 Thread Wido den Hollander
On 10/19/2015 09:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few 
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own 
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but 
> at a minimum it is a couple btree lookups.  We'd love to use open by 
> handle (which would reduce this to 1 btree traversal), but running 
> the daemon as ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is 
> a overwrite with no allocation changes.  (We don't care about mtime.)  
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep 
> it pretty simple, and manage it in kv store along with all of our other 
> metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block 
> device, one to commit our transaction (vs 4+ before).  For overwrites, 
> we'd have one io to do our write-ahead log (kv journal), then do 
> the overwrite async (vs 4+ before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects 
> are not fragmented, then the metadata to store the block offsets is about 
> the same size as the metadata to store the filenames we have now. 
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS 
> partition) vs the block storage.  Maybe we do this anyway (put metadata on 
> SSD!) so it won't matter.  But what happens when we are storing gobs of 
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of 
> a different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this 
> can be reasonbly simple, especially for the flash case (where 
> fragmentation isn't such an issue as long as our blocks are reasonbly 
> sized).  For disk we may beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The 
> good news is it'll just need to validate what we have stored in the kv 
> store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block 
> layers might help us with elasticity of file vs block areas.
> 

I've been using bcache for a while now in production and that helped a lot.

Intel SSDs with GPT. First few partitions as Journals and then one big
partition for bcache.

/dev/bcache02.8T  264G  2.5T  10% /var/lib/ceph/osd/ceph-60
/dev/bcache12.8T  317G  2.5T  12% /var/lib/ceph/osd/ceph-61
/dev/bcache22.8T  303G  2.5T  11% /var/lib/ceph/osd/ceph-62
/dev/bcache32.8T  316G  2.5T  12% /var/lib/ceph/osd/ceph-63
/dev/bcache42.8T  167G  2.6T   6% /var/lib/ceph/osd/ceph-64
/dev/bcache52.8T  295G  2.5T  11% /var/lib/ceph/osd/ceph-65

The maintainers from bcache also presented bcachefs:
https://lkml.org/lkml/2015/8/21/22

"checksumming, compression: currently only zlib is supported for
compression, and for checksumming there's crc32c and a 64 bit checksum."

Wouldn't that be something that can be leveraged from? Consuming a raw
block device seems like re-inventing the wheel to me. I might be wrong
though.

I have no idea how stable bcachefs is, but it might be worth looking in to.

>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd 
> directory for stuff it has to push off.  Then have a conservative amount 
> of file space on the hdd.  If our block fills up, use the existing file 
> mechanism to put data there too.  (But then we have to maintain both the 
> current kv + file approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

newstore direction

2015-10-19 Thread Sage Weil
The current design is based on two simple ideas:

 1) a key/value interface is better way to manage all of our internal 
metadata (object metadata, attrs, layout, collection membership, 
write-ahead logging, overlay data, etc.)

 2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few 
things:

 - We currently write the data to the file, fsync, then commit the kv 
transaction.  That's at least 3 IOs: one for the data, one for the fs 
journal, one for the kv txn to commit (at least once my rocksdb changes 
land... the kv commit is currently 2-3).  So two people are managing 
metadata, here: the fs managing the file metadata (with its own 
journal) and the kv backend (with its journal).

 - On read we have to open files by name, which means traversing the fs 
namespace.  Newstore tries to keep it as flat and simple as possible, but 
at a minimum it is a couple btree lookups.  We'd love to use open by 
handle (which would reduce this to 1 btree traversal), but running 
the daemon as ceph and not root makes that hard...

 - ...and file systems insist on updating mtime on writes, even when it is 
a overwrite with no allocation changes.  (We don't care about mtime.)  
O_NOCMTIME patches exist but it is hard to get these past the kernel 
brainfreeze.

 - XFS is (probably) never going going to give us data checksums, which we 
want desperately.

But what's the alternative?  My thought is to just bite the bullet and 
consume a raw block device directly.  Write an allocator, hopefully keep 
it pretty simple, and manage it in kv store along with all of our other 
metadata.

Wins:

 - 2 IOs for most: one to write the data to unused space in the block 
device, one to commit our transaction (vs 4+ before).  For overwrites, 
we'd have one io to do our write-ahead log (kv journal), then do 
the overwrite async (vs 4+ before).

 - No concern about mtime getting in the way

 - Faster reads (no fs lookup)

 - Similarly sized metadata for most objects.  If we assume most objects 
are not fragmented, then the metadata to store the block offsets is about 
the same size as the metadata to store the filenames we have now. 

Problems:

 - We have to size the kv backend storage (probably still an XFS 
partition) vs the block storage.  Maybe we do this anyway (put metadata on 
SSD!) so it won't matter.  But what happens when we are storing gobs of 
rgw index data or cephfs metadata?  Suddenly we are pulling storage out of 
a different pool and those aren't currently fungible.

 - We have to write and maintain an allocator.  I'm still optimistic this 
can be reasonbly simple, especially for the flash case (where 
fragmentation isn't such an issue as long as our blocks are reasonbly 
sized).  For disk we may beed to be moderately clever.

 - We'll need a fsck to ensure our internal metadata is consistent.  The 
good news is it'll just need to validate what we have stored in the kv 
store.

Other thoughts:

 - We might want to consider whether dm-thin or bcache or other block 
layers might help us with elasticity of file vs block areas.

 - Rocksdb can push colder data to a second directory, so we could have a 
fast ssd primary area (for wal and most metadata) and a second hdd 
directory for stuff it has to push off.  Then have a conservative amount 
of file space on the hdd.  If our block fills up, use the existing file 
mechanism to put data there too.  (But then we have to maintain both the 
current kv + file approach and not go all-in on kv + block.)

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-19 Thread Varada Kari
Hi Sage,

If we are managing the raw device, does it make sense to have a key value store 
to manage the whole space? 
Having metadata of the allocator might cause some other problems of 
consistency. Getting an fsck for that implementation can be tougher, we might 
have to have strict crc computations on the data. And have to manage sanity of 
the DB managing them.
If we can have a common mechanism of having data and metadata the same keyvalue 
store, will improve the performance. 
We have integrated a custom made key value store which works on raw device the 
key value store backend. And we have observed better bw utilization and iops.
Read/writes can be faster and no fslookup needed. We have tools like fsck to 
care of consistency of DB. 

Couple of comments inline.

Thanks,
Varada

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, October 20, 2015 1:19 AM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal metadata
> (object metadata, attrs, layout, collection membership, write-ahead logging,
> overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one
> for the kv txn to commit (at least once my rocksdb changes land... the kv
> commit is currently 2-3).  So two people are managing metadata, here: the fs
> managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and consume
> a raw block device directly.  Write an allocator, hopefully keep it pretty
> simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw
> index data or cephfs metadata?  Suddenly we are pulling storage out of a
> different pool and those aren't currently fungible.

[Varada Kari]  Ideally if we can manage the raw device as key value store 
indirection to manage metadata and data both, we can benefit with faster 
lookups and writes (if the KVStore supports a batch atomic transactional 
write). SSD's might suffer with more write  amplification by putting the meta 
data alone, if we can manage this part(KV Store to deal with raw device) 
also(handling small writes) we can avoid write amplification and get better 
throughput from the device.

>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be
> reasonbly simple, especially for the flash case (where fragmentation isn't
> such an issue as long as our blocks are reasonbly sized).  For disk we may
> beed to be moderately clever.
> 
[Varada Kari] Yes. If the writes are aligned to flash programmable page size, 
that will not cause any issues. But writes less than programmable page size 
will cause internal fragmentation. Repeated overwrites to the same, will cause 
more write amplification.

>  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block
> layers 

RE: newstore direction

2015-10-19 Thread James (Fei) Liu-SSI
Hi Sage and Somnath,
  In my humble opinion, There is another more aggressive  solution than raw 
block device base keyvalue store as backend for objectstore. The new key value  
SSD device with transaction support would be  ideal to solve the issues. First 
of all, it is raw SSD device. Secondly , It provides key value interface 
directly from SSD. Thirdly, it can provide transaction support, consistency 
will be guaranteed by hardware device. It pretty much satisfied all of 
objectstore needs without any extra overhead since there is not any extra layer 
in between device and objectstore. 
   Either way, I strongly support to have CEPH own data format instead of 
relying on filesystem.  

  Regards,
  James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 1:55 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get 
> rid of this filesystem overhead (which I am in process of measuring). 
> Also, it will be good if we can eliminate the dependency on the k/v 
> dbs (for storing allocators and all). The reason is the unknown write 
> amps they causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as
appropriate) so that other backends can be easily swapped in (e.g. a 
btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb 
> changes land... the kv commit is currently 2-3).  So two people are 
> managing metadata, here: the fs managing the file metadata (with its 
> own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but at 
> a minimum it is a couple btree lookups.  We'd love to use open by handle 
> (which would reduce this to 1 btree traversal), but running the daemon as 
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a 
> overwrite with no allocation changes.  (We don't care about mtime.) 
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep it 
> pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, 
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one 
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are 
> not fragmented, then the metadata to store the block offsets is about the 
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put 
> metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
> index data or cephfs metadata?  Suddenly we are pulling storage out of a 
> different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be reasonbly simple, especially for the flash case (where fragmentation isn't 
> such an issue as long as our blocks are reasonbly sized).  For disk we may 
> beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The good 
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether 

Re: newstore direction

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.

This is the concerning bit for me -- the other parts one "just" has to
get the code right, but this problem could linger and be something we
have to keep explaining to users indefinitely.  It reminds me of cases
in other systems where users had to make an educated guess about inode
size up front, depending on whether you're expecting to efficiently
store a lot of xattrs.

In practice it's rare for users to make these kinds of decisions well
up-front: it really needs to be adjustable later, ideally
automatically.  That could be pretty straightforward if the KV part
was stored directly on block storage, instead of having XFS in the
mix.  I'm not quite up with the state of the art in this area: are
there any reasonable alternatives for the KV part that would consume
some defined range of a block device from userspace, instead of
sitting on top of a filesystem?

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-19 Thread Chen, Xiaoxi
+1,  nowadays K-V DB care more about very small key-value pairs, say several 
bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, 
NVMKV is a good design and seems some of the SSD vendor are also trying to 
build this kind of interface, we had a NVM-L library but still under 
development.
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, October 20, 2015 6:21 AM
> To: Sage Weil; Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive  solution than raw
> block device base keyvalue store as backend for objectstore. The new key
> value  SSD device with transaction support would be  ideal to solve the 
> issues.
> First of all, it is raw SSD device. Secondly , It provides key value interface
> directly from SSD. Thirdly, it can provide transaction support, consistency 
> will
> be guaranteed by hardware device. It pretty much satisfied all of objectstore
> needs without any extra overhead since there is not any extra layer in
> between device and objectstore.
>Either way, I strongly support to have CEPH own data format instead of
> relying on filesystem.
> 
>   Regards,
>   James
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v
> > dbs (for storing allocators and all). The reason is the unknown write
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-
> based one for high-end flash).
> 
> sage
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> >
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb
> > changes land... the kv commit is currently 2-3).  So two people are
> > managing metadata, here: the fs managing the file metadata (with its
> > own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep it
> pretty simple, and manage it in kv store along with all of our other metadata.
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block 
> > device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> >
> >  - No concern about mtime getting in the way
> >
> >  - Faster reads (no fs lookup)
> >
> >  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe 

RE: newstore direction

2015-10-19 Thread Chen, Xiaoxi
There is something like : http://pmem.io/nvml/libpmemobj/ to adapt NVMe to 
transactional object storage.

But definitely need some more works

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Tuesday, October 20, 2015 10:33 AM
> To: James (Fei) Liu-SSI; Sage Weil; Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi James,
> 
> Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ?
> If SCSI OSD is what you are mentioning, drive has to support all osd
> functionality mentioned by T10.
> If not, we have to implement the same functionality in kernel or have a
> wrapper in user space to convert them to read/write calls.  This seems more
> effort.
> 
> Varada
> 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> > Sent: Tuesday, October 20, 2015 3:51 AM
> > To: Sage Weil ; Somnath Roy
> > 
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> >
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive  solution
> > than raw block device base keyvalue store as backend for objectstore.
> > The new key value  SSD device with transaction support would be  ideal
> > to solve the issues. First of all, it is raw SSD device. Secondly , It
> > provides key value interface directly from SSD. Thirdly, it can
> > provide transaction support, consistency will be guaranteed by
> > hardware device. It pretty much satisfied all of objectstore needs
> > without any extra overhead since there is not any extra layer in between
> device and objectstore.
> >Either way, I strongly support to have CEPH own data format instead
> > of relying on filesystem.
> >
> >   Regards,
> >   James
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> >
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v
> > > dbs (for storing allocators and all). The reason is the unknown
> > > write amps they causes.
> >
> > My hope is to keep behing the KeyValueDB interface (and/more change it
> > as
> > appropriate) so that other backends can be easily swapped in (e.g. a
> > btree- based one for high-end flash).
> >
> > sage
> >
> >
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > >
> > > -Original Message-
> > > From: ceph-devel-ow...@vger.kernel.org
> > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > >
> > > The current design is based on two simple ideas:
> > >
> > >  1) a key/value interface is better way to manage all of our
> > > internal metadata (object metadata, attrs, layout, collection
> > > membership, write-ahead logging, overlay data, etc.)
> > >
> > >  2) a file system is well suited for storage object data (as files).
> > >
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.
> > > A few
> > > things:
> > >
> > >  - We currently write the data to the file, fsync, then commit the
> > > kv transaction.  That's at least 3 IOs: one for the data, one for
> > > the fs journal, one for the kv txn to commit (at least once my
> > > rocksdb changes land... the kv commit is currently 2-3).  So two
> > > people are managing metadata, here: the fs managing the file
> > > metadata (with its own
> > > journal) and the kv backend (with its journal).
> > >
> > >  - On read we have to open files by name, which means traversing the
> > > fs
> > namespace.  Newstore tries to keep it as flat and simple as possible,
> > but at a minimum it is a couple btree lookups.  We'd love to use open
> > by handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> > >
> > >  - ...and file systems insist on updating mtime on writes, even when
> > > it is a
> > overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> > >
> > >  - XFS is (probably) never going going to give us data checksums,
> > > which we
> > want desperately.
> > >
> > > But what's the alternative?  My thought is to just bite the bullet
> > > and
> > consume a raw block device directly.  Write an allocator, hopefully
> > keep it pretty simple, and manage it in kv store along with 

RE: newstore direction

2015-10-19 Thread Varada Kari
Hi James,

Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? If 
SCSI OSD is what you are mentioning, drive has to support all osd functionality 
mentioned by T10.
If not, we have to implement the same functionality in kernel or have a wrapper 
in user space to convert them to read/write calls.  This seems more effort.

Varada

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, October 20, 2015 3:51 AM
> To: Sage Weil ; Somnath Roy
> 
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive  solution than raw
> block device base keyvalue store as backend for objectstore. The new key
> value  SSD device with transaction support would be  ideal to solve the
> issues. First of all, it is raw SSD device. Secondly , It provides key value
> interface directly from SSD. Thirdly, it can provide transaction support,
> consistency will be guaranteed by hardware device. It pretty much satisfied
> all of objectstore needs without any extra overhead since there is not any
> extra layer in between device and objectstore.
>Either way, I strongly support to have CEPH own data format instead of
> relying on filesystem.
> 
>   Regards,
>   James
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v
> > dbs (for storing allocators and all). The reason is the unknown write
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-
> based one for high-end flash).
> 
> sage
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> >
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb
> > changes land... the kv commit is currently 2-3).  So two people are
> > managing metadata, here: the fs managing the file metadata (with its
> > own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep it
> pretty simple, and manage it in kv store along with all of our other metadata.
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block 
> > device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> >
> >  - No concern about mtime getting in the way
> >
> >  - Faster reads (no fs lookup)
> >
> >  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >  - We have to size the kv backend 

Re: newstore direction

2015-10-19 Thread Haomai Wang
On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil  wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.

This is really a tough decision. Although making a block device based
objectstore never walk out my mind since two years ago.

We would much more concern about the effective of space utilization
compared to local fs,  the buggy, the consuming time to build a tiny
local filesystem. I'm a little afraid of we would stuck into

>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
area from my perf.

>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>  - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)

A complex way...

Actually I would like to employ FileStore2 impl, which means we still
use FileJournal(or alike ..). But we need to employ more memory to
keep metadata/xattrs and use aio+dio to flush disk. A userspace
pagecache needed to be impl. Then we can skip journal if full write,
because osd is pg isolation we could make a barrier for single pg when
skipping journal. @Sage Is there other concerns for filestore skip
journal?

In a word, I like the model that filestore owns, but we need to have a
big refactor for existing impl.

Sorry to disturb the thought

>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--

Re: [PATCH v3] net: ceph: messenger: Use local variable cursor instead of >cursor

2015-10-19 Thread Ilya Dryomov
On Mon, Oct 19, 2015 at 6:29 PM, Shraddha Barke  wrote:
> Use local variable cursor in place of >cursor in
> read_partial_msg_data() and write_partial_msg_data()
>
> Signed-off-by: Shraddha Barke 
> ---
> Changes in v3-
>  Replace >cursor with cursor in write_partial_msg_data() too
>
>  net/ceph/messenger.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..01b567b 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1552,8 +1552,8 @@ static int write_partial_message_data(struct 
> ceph_connection *con)
> bool need_crc;
> int ret;
>
> -   page = ceph_msg_data_next(>cursor, _offset, ,
> -   _piece);
> +   page = ceph_msg_data_next(cursor, _offset, ,
> + _piece);
> ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> length, !last_piece);
> if (ret <= 0) {
> @@ -1564,7 +1564,7 @@ static int write_partial_message_data(struct 
> ceph_connection *con)
> }
> if (do_datacrc && cursor->need_crc)
> crc = ceph_crc32c_page(crc, page, page_offset, 
> length);
> -   need_crc = ceph_msg_data_advance(>cursor, (size_t)ret);
> +   need_crc = ceph_msg_data_advance(cursor, (size_t)ret);
> }
>
> dout("%s %p msg %p done\n", __func__, con, msg);
> @@ -2246,8 +2246,8 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
> if (do_datacrc)
> crc = con->in_data_crc;
> while (cursor->resid) {
> -   page = ceph_msg_data_next(>cursor, _offset, ,
> -   NULL);
> +   page = ceph_msg_data_next(cursor, _offset, ,
> + NULL);
> ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
> if (ret <= 0) {
> if (do_datacrc)
> @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>
> if (do_datacrc)
> crc = ceph_crc32c_page(crc, page, page_offset, ret);
> -   (void) ceph_msg_data_advance(>cursor, (size_t)ret);
> +   (void) ceph_msg_data_advance(cursor, (size_t)ret);
> }
> if (do_datacrc)
> con->in_data_crc = crc;

Applied, see
https://github.com/ceph/ceph-client/commit/621a56fd69751d263795f4f35e65eff7daa3a470.

Generally, we prefix net/ceph commits with "libceph".  Different
subsystems have different conventions, use "git log --oneline "
to get a sense of what is preferred.

Thanks,

Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash when deep-scrubbing

2015-10-19 Thread changtao381
Jiaying Ren  gmail.com> writes:

> 
> Hi, cephers:
> 
> I've encountered a problem that a pg stuck in inconsistent status:
> 
> $ ceph -s
> cluster 27d39faa-48ae-4356-a8e3-19d5b81e179e
>  health HEALTH_ERR 1 pgs inconsistent; 34 near full osd(s); 1
> scrub errors; noout flag(s) set
>  monmap e4: 3 mons at
>
{server-61.0..x.in=10.8.0.61:6789/0,server-62.0..x.i
n=10.8.0.62:6789/0,server-63.0..x.in=10.8.0.63:6789/0},
> election epoch 6706, quorum 0,1,2
>
server-61.0..x.in,server-62.0..x.in,server-63.0.
.x.in
>  osdmap e87808: 180 osds: 180 up, 180 in
> flags noout
>   pgmap v29322850: 35026 pgs, 15 pools, 27768 GB data, 1905 kobjects
> 83575 GB used, 114 TB / 196 TB avail
>35025 active+clean
>1 active+clean+inconsistent
>   client io 120 kB/s rd, 216 MB/s wr, 6398 op/s
> 
> `pg repair` cmd doesn't work, so I manually repaired a inconsistent
object(pool
> size is 3,I removed the object different from other two copys).after that
pg
> still in inconsistent status:
> 
> $ ceph pg dump | grep active+clean+inconsistent
> dumped all in format plain
> 3.d70   290 0   0   0   4600869888  30503050
>   stale+active+clean+inconsistent 2015-10-18 13:05:43.320451
>   87798'7631234   87798:10758311[131,119,132]   131
>   [131,119,132]   131 85161'7599152   2015-10-16 14:34:21.283303
>   85161'7599152   2015-10-16 14:34:21.283303
> 
> And after restarted osd.131, the primary osd osd.131 would crash,the
straceback:
> 
>  1: /usr/bin/ceph-osd() [0x9c6de1]
>  2: (()+0xf790) [0x7f384b6b8790]
>  3: (gsignal()+0x35) [0x7f384a58a625]
>  4: (abort()+0x175) [0x7f384a58be05]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f384ae44a5d]
>  6: (()+0xbcbe6) [0x7f384ae42be6]
>  7: (()+0xbcc13) [0x7f384ae42c13]
>  8: (()+0xbcd0e) [0x7f384ae42d0e]
>  9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e)
[0x9cd0de]
>  10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x81)
[0x7dfaf1]
>  11: (PG::_scan_snaps(ScrubMap&)+0x394) [0x84b8c4]
>  12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
> ThreadPool::TPHandle&)+0x27b) [0x84cdab]
>  13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x5c4) [0x85c1b4]
>  14: (PG::scrub(ThreadPool::TPHandle&)+0x181) [0x85d691]
>  15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x1c) [0x6737cc]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x53d) [0x9e05dd]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760]
>  18: (()+0x7a51) [0x7f384b6b0a51]
>  19: (clone()+0x6d) [0x7f384a6409ad]
> 
> ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70`
would also
> cause osd crash.
> 
> Any ideas? or did I missed some logs necessary for further investigation?
> 
> Thx.
> 
> --
> Best Regards!
> Jiaying Ren(mikulely)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo  vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

I have met a problem when run 'ceph pg deep-scrub' command. It also causes
osd crash. And finally i find some sector of the disk have corrupted .so
please check dmesg info to check weather there is some disk errors


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] net: ceph: messenger: Use local variable cursor instead of >cursor

2015-10-19 Thread Shraddha Barke
Use local variable cursor in place of >cursor in
read_partial_msg_data() and write_partial_msg_data()

Signed-off-by: Shraddha Barke 
---
Changes in v3-
 Replace >cursor with cursor in write_partial_msg_data() too

 net/ceph/messenger.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index b9b0e3b..01b567b 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1552,8 +1552,8 @@ static int write_partial_message_data(struct 
ceph_connection *con)
bool need_crc;
int ret;
 
-   page = ceph_msg_data_next(>cursor, _offset, ,
-   _piece);
+   page = ceph_msg_data_next(cursor, _offset, ,
+ _piece);
ret = ceph_tcp_sendpage(con->sock, page, page_offset,
length, !last_piece);
if (ret <= 0) {
@@ -1564,7 +1564,7 @@ static int write_partial_message_data(struct 
ceph_connection *con)
}
if (do_datacrc && cursor->need_crc)
crc = ceph_crc32c_page(crc, page, page_offset, length);
-   need_crc = ceph_msg_data_advance(>cursor, (size_t)ret);
+   need_crc = ceph_msg_data_advance(cursor, (size_t)ret);
}
 
dout("%s %p msg %p done\n", __func__, con, msg);
@@ -2246,8 +2246,8 @@ static int read_partial_msg_data(struct ceph_connection 
*con)
if (do_datacrc)
crc = con->in_data_crc;
while (cursor->resid) {
-   page = ceph_msg_data_next(>cursor, _offset, ,
-   NULL);
+   page = ceph_msg_data_next(cursor, _offset, ,
+ NULL);
ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
if (ret <= 0) {
if (do_datacrc)
@@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
*con)
 
if (do_datacrc)
crc = ceph_crc32c_page(crc, page, page_offset, ret);
-   (void) ceph_msg_data_advance(>cursor, (size_t)ret);
+   (void) ceph_msg_data_advance(cursor, (size_t)ret);
}
if (do_datacrc)
con->in_data_crc = crc;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails

2015-10-19 Thread Ilya Dryomov
Returning pool id (i.e. >= 0) from a sysfs ->store() callback makes
userspace think it needs to retry the write.  Fix it - it's a leftover
from the times when the equivalent of rbd_dev_create() was the first
action in rbd_add().

Signed-off-by: Ilya Dryomov 
---
 drivers/block/rbd.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 07f666f4ca18..df795deffe77 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5394,7 +5394,7 @@ static ssize_t do_rbd_add(struct bus_type *bus,
struct rbd_spec *spec = NULL;
struct rbd_client *rbdc;
bool read_only;
-   int rc = -ENOMEM;
+   int rc;
 
if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -5429,8 +5429,10 @@ static ssize_t do_rbd_add(struct bus_type *bus,
}
 
rbd_dev = rbd_dev_create(rbdc, spec, rbd_opts);
-   if (!rbd_dev)
+   if (!rbd_dev) {
+   rc = -ENOMEM;
goto err_out_client;
+   }
rbdc = NULL;/* rbd_dev now owns this */
spec = NULL;/* rbd_dev now owns this */
rbd_opts = NULL;/* rbd_dev now owns this */
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: set device_type::release instead of device::release

2015-10-19 Thread Ilya Dryomov
No point in providing an empty device_type::release callback and then
setting device::release for each rbd_dev dynamically.

Signed-off-by: Ilya Dryomov 
---
 drivers/block/rbd.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4917de726bdb..fa6767e9ed2a 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3986,14 +3986,12 @@ static const struct attribute_group *rbd_attr_groups[] 
= {
NULL
 };
 
-static void rbd_sysfs_dev_release(struct device *dev)
-{
-}
+static void rbd_dev_release(struct device *dev);
 
 static struct device_type rbd_device_type = {
.name   = "rbd",
.groups = rbd_attr_groups,
-   .release= rbd_sysfs_dev_release,
+   .release= rbd_dev_release,
 };
 
 static struct rbd_spec *rbd_spec_get(struct rbd_spec *spec)
@@ -4074,7 +4072,6 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
rbd_dev->dev.bus = _bus_type;
rbd_dev->dev.type = _device_type;
rbd_dev->dev.parent = _root_dev;
-   rbd_dev->dev.release = rbd_dev_release;
device_initialize(_dev->dev);
 
rbd_dev->rbd_client = rbdc;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd: don't free rbd_dev outside of the release callback

2015-10-19 Thread Ilya Dryomov
struct rbd_device has struct device embedded in it, which means it's
part of kobject universe and has an unpredictable life cycle.  Freeing
its memory outside of the release callback is flawed, yet commits
200a6a8be5db ("rbd: don't destroy rbd_dev in device release function")
and 8ad42cd0c002 ("rbd: don't have device release destroy rbd_dev")
moved rbd_dev_destroy() out to rbd_dev_image_release().

This commit reverts most of that, the key points are:

- rbd_dev->dev is initialized in rbd_dev_create(), making it possible
  to use rbd_dev_destroy() - which is just a put_device() - both before
  we register with device core and after.

- rbd_dev_release() (the release callback) is the only place we
  kfree(rbd_dev).  It's also where we do module_put(), keeping the
  module unload race window as small as possible.

- We pin the module in rbd_dev_create(), but only for mapping
  rbd_dev-s.  Moving image related stuff out of struct rbd_device into
  another struct which isn't tied with sysfs and device core is long
  overdue, but until that happens, this will keep rbd module refcount
  (which users can observe with lsmod) sane.

Fixes: http://tracker.ceph.com/issues/12697

Cc: Alex Elder 
Signed-off-by: Ilya Dryomov 
---
 drivers/block/rbd.c | 89 -
 1 file changed, 47 insertions(+), 42 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index df795deffe77..4917de726bdb 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -418,8 +418,6 @@ MODULE_PARM_DESC(single_major, "Use a single major number 
for all rbd devices (d
 
 static int rbd_img_request_submit(struct rbd_img_request *img_request);
 
-static void rbd_dev_device_release(struct device *dev);
-
 static ssize_t rbd_add(struct bus_type *bus, const char *buf,
   size_t count);
 static ssize_t rbd_remove(struct bus_type *bus, const char *buf,
@@ -4038,6 +4036,25 @@ static void rbd_spec_free(struct kref *kref)
kfree(spec);
 }
 
+static void rbd_dev_release(struct device *dev)
+{
+   struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
+   bool need_put = !!rbd_dev->opts;
+
+   rbd_put_client(rbd_dev->rbd_client);
+   rbd_spec_put(rbd_dev->spec);
+   kfree(rbd_dev->opts);
+   kfree(rbd_dev);
+
+   /*
+* This is racy, but way better than putting module outside of
+* the release callback.  The race window is pretty small, so
+* doing something similar to dm (dm-builtin.c) is overkill.
+*/
+   if (need_put)
+   module_put(THIS_MODULE);
+}
+
 static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
 struct rbd_spec *spec,
 struct rbd_options *opts)
@@ -4054,6 +4071,12 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
INIT_LIST_HEAD(_dev->node);
init_rwsem(_dev->header_rwsem);
 
+   rbd_dev->dev.bus = _bus_type;
+   rbd_dev->dev.type = _device_type;
+   rbd_dev->dev.parent = _root_dev;
+   rbd_dev->dev.release = rbd_dev_release;
+   device_initialize(_dev->dev);
+
rbd_dev->rbd_client = rbdc;
rbd_dev->spec = spec;
rbd_dev->opts = opts;
@@ -4065,15 +4088,21 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
rbd_dev->layout.fl_object_size = cpu_to_le32(1 << RBD_MAX_OBJ_ORDER);
rbd_dev->layout.fl_pg_pool = cpu_to_le32((u32) spec->pool_id);
 
+   /*
+* If this is a mapping rbd_dev (as opposed to a parent one),
+* pin our module.  We have a ref from do_rbd_add(), so use
+* __module_get().
+*/
+   if (rbd_dev->opts)
+   __module_get(THIS_MODULE);
+
return rbd_dev;
 }
 
 static void rbd_dev_destroy(struct rbd_device *rbd_dev)
 {
-   rbd_put_client(rbd_dev->rbd_client);
-   rbd_spec_put(rbd_dev->spec);
-   kfree(rbd_dev->opts);
-   kfree(rbd_dev);
+   if (rbd_dev)
+   put_device(_dev->dev);
 }
 
 /*
@@ -4699,27 +4728,6 @@ static int rbd_dev_header_info(struct rbd_device 
*rbd_dev)
return rbd_dev_v2_header_info(rbd_dev);
 }
 
-static int rbd_bus_add_dev(struct rbd_device *rbd_dev)
-{
-   struct device *dev;
-   int ret;
-
-   dev = _dev->dev;
-   dev->bus = _bus_type;
-   dev->type = _device_type;
-   dev->parent = _root_dev;
-   dev->release = rbd_dev_device_release;
-   dev_set_name(dev, "%d", rbd_dev->dev_id);
-   ret = device_register(dev);
-
-   return ret;
-}
-
-static void rbd_bus_del_dev(struct rbd_device *rbd_dev)
-{
-   device_unregister(_dev->dev);
-}
-
 /*
  * Get a unique rbd identifier for the given new rbd_dev, and add
  * the rbd_dev to the global list.
@@ -5222,7 +5230,8 @@ static int rbd_dev_device_setup(struct rbd_device 
*rbd_dev)
set_capacity(rbd_dev->disk, rbd_dev->mapping.size /