Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:
 Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:
 On Tue, Nov 20, 2012 at 8:16 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Stefan,
 
 Am 20.11.2012 17:29, schrieb Stefan Hajnoczi:
 
 On Tue, Nov 20, 2012 at 01:44:55PM +0100, Stefan Priebe wrote:
 
 rbd / rados tends to return pretty often length of writes
 or discarded blocks. These values might be bigger than int.
 
 Signed-off-by: Stefan Priebe s.pri...@profihost.ag
 ---
block/rbd.c |4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
 
 
 Looks good but I want to check whether this fixes an bug you've hit?
 Please indicate details of the bug and how to reproduce it in the commit
 message.
 
 
 you get various I/O errors in client. As negative return values indicate I/O
 errors. When now a big positive value is returned by librbd block/rbd tries
 to store this one in acb-ret which is an int. Then it wraps around and is
 negative. After that block/rbd thinks this is an I/O error and report this
 to the guest.
 
 It's still not clear whether this is a bug that you can reproduce.
 After all, the ret value would have to be 2^31 which is a 2+ GB
 request!
 Yes and that is the fact.
 
 Look here:
if (acb-cmd == RBD_AIO_WRITE ||
 acb-cmd == RBD_AIO_DISCARD) {
 if (r  0) {
 acb-ret = r;
 acb-error = 1;
 } else if (!acb-error) {
 acb-ret = rcb-size;
 }
 
 It sets acb-ret to rcb-size. But the size from a DISCARD if you
 DISCARD a whole device might be 500GB or today even some TB.

We're going in circles here.  I know the types are wrong in the code and
your patch fixes it, that's why I said it looks good in my first reply.

QEMU is currently in hard freeze and only critical patches should go in.
Providing steps to reproduce the bug helps me decide that this patch
should still be merged for QEMU 1.3-rc1.

Anyway, the patch is straightforward, I have applied it to my block tree
and it will be in QEMU 1.3-rc1:
https://github.com/stefanha/qemu/commits/block

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Priebe - Profihost AG

Am 21.11.2012 09:26, schrieb Stefan Hajnoczi:

On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:

Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:

We're going in circles here.  I know the types are wrong in the code and
your patch fixes it, that's why I said it looks good in my first reply.


Sorry not so familiar with processes like these.



QEMU is currently in hard freeze and only critical patches should go in.
Providing steps to reproduce the bug helps me decide that this patch
should still be merged for QEMU 1.3-rc1.

Anyway, the patch is straightforward, I have applied it to my block tree
and it will be in QEMU 1.3-rc1:
https://github.com/stefanha/qemu/commits/block


Thanks!

The steps to reproduce are:
mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends a 
discard. Important is that you use scsi-hd and set 
discard_granularity=512. Otherwise rbd disabled discard support.


Might you have a look at my other rbd fix too? It fixes a race between 
task cancellation and writes. The same race was fixed in iscsi this summer.


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 09:33:08AM +0100, Stefan Priebe - Profihost AG wrote:
 Am 21.11.2012 09:26, schrieb Stefan Hajnoczi:
 On Wed, Nov 21, 2012 at 08:47:16AM +0100, Stefan Priebe - Profihost AG wrote:
 Am 21.11.2012 07:41, schrieb Stefan Hajnoczi:
 QEMU is currently in hard freeze and only critical patches should go in.
 Providing steps to reproduce the bug helps me decide that this patch
 should still be merged for QEMU 1.3-rc1.
 
 Anyway, the patch is straightforward, I have applied it to my block tree
 and it will be in QEMU 1.3-rc1:
 https://github.com/stefanha/qemu/commits/block
 
 Thanks!
 
 The steps to reproduce are:
 mkfs.xfs -f a whole device bigger than int in bytes. mkfs.xfs sends
 a discard. Important is that you use scsi-hd and set
 discard_granularity=512. Otherwise rbd disabled discard support.

Excellent, thanks!  I will add it to the commit description.

 Might you have a look at my other rbd fix too? It fixes a race
 between task cancellation and writes. The same race was fixed in
 iscsi this summer.

Yes.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Openstack] Ceph + Nova

2012-11-21 Thread Sébastien Han
Hi,

I don't think it's the best place to ask your question since it's not
directly related to OpenStack but more about Ceph. I just put in c/c
the ceph ML. Anyway, CephFS is not ready yet for production but I
heard that some people use it. People from Inktank (the company behind
Ceph) don't recommend it, AFAIR they expect something more production
ready for Q2 2013. You can use it (I did, for testing purpose) but
it's at your own risk.
Beside of this RBD and RADOS are robust and stable now, so you can go
with the Cinder and Glance integration without any problems.

Cheers!

On Wed, Nov 21, 2012 at 9:37 AM, JuanFra Rodríguez Cardoso
juanfra.rodriguez.card...@gmail.com wrote:
 Hi everyone:

 I'd like to know your opinion as nova experts:

 Would you recommend CephFS as shared storage in /var/lib/nova/instances?
 Another option it would be use GlusterFS or MooseFS for
 /var/lib/nova/instances directory and Ceph RBD for Glance and Nova volumes,
 don't you think?

 Thanks for your attention.

 Best regards,
 JuanFra

 ___
 Mailing list: https://launchpad.net/~openstack
 Post to : openst...@lists.launchpad.net
 Unsubscribe : https://launchpad.net/~openstack
 More help   : https://help.launchpad.net/ListHelp

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make mkcephfs and init-ceph osd filesystem handling more flexible

2012-11-21 Thread Danny Al-Gaaf
Hi,

no, I have it basically ready but I have to run some tests before.
You'll have it in the next days!

Danny

Am 21.11.2012 01:23, schrieb Sage Weil:
 If you haven't gotten to this yet, I'll go ahead and jump on it..
 let me know!
 
 Thanks- sage
 
 
 On Thu, 9 Aug 2012, Danny Kukawka wrote:
 
 Remove btrfs specific keys and replace them by more generic keys
 to be able to replace btrfs with e.g. xfs or ext4 easily.
 
 Add new key to define the osd fs type: 'fstype', which can get 
 defined in the [osd] section for all OSDs.
 
 Replace: - 'btrfs devs' - 'devs' - 'btrfs path' - 'fs path' -
 'btrfs options' - 'fs options' - mkcephfs: replace --mkbtrfs
 with --mkfs - init-ceph: replace --btrfs with --fsmount,
 --nobtrfs with --nofsmount, --btrfsumount with --fsumount
 
 Update documentation, manpage and example config files.
 
 Signed-off-by: Danny Kukawka danny.kuka...@bisect.de --- 
 doc/man/8/mkcephfs.rst  |   17 +++- 
 man/mkcephfs.8  |   15 +++ 
 src/ceph.conf.twoosds   |7 ++-- 
 src/init-ceph.in|   50
 +- src/mkcephfs.in
 |   60 +-- src/sample.ceph.conf
 |   15 --- src/test/cli/osdmaptool/ceph.conf.withracks |3
 +- 7 Dateien ge?ndert, 95 Zeilen hinzugef?gt(+), 72 Zeilen
 entfernt(-)
 
 diff --git a/doc/man/8/mkcephfs.rst b/doc/man/8/mkcephfs.rst 
 index ddc378a..dd3fbd5 100644 --- a/doc/man/8/mkcephfs.rst +++
 b/doc/man/8/mkcephfs.rst @@ -70,20 +70,15 @@ Options default is
 ``/etc/ceph/keyring`` (or whatever is specified in the config
 file).
 
 -.. option:: --mkbtrfs +.. option:: --mkfs
 
 -   Create and mount the any btrfs file systems specified in the 
 -   ceph.conf for OSD data storage using mkfs.btrfs. The btrfs
 devs -   and (if it differs from osd data) btrfs path
 options must be -   defined. +   Create and mount any file system
 specified in the ceph.conf for +   OSD data storage using mkfs.
 The devs and (if it differs from +   osd data) fs path
 options must be defined.
 
 **NOTE** Btrfs is still considered experimental.  This option -
 can ease some configuration pain, but is the use of btrfs is not 
 -   required when ``osd data`` directories are mounted manually
 by the -   adminstrator. - -   **NOTE** This option is deprecated
 and will be removed in a future -   release. +   can ease some
 configuration pain, but is not required when +   ``osd data``
 directories are mounted manually by the adminstrator.
 
 .. option:: --no-copy-conf
 
 diff --git a/man/mkcephfs.8 b/man/mkcephfs.8 index
 8544a01..22a5335 100644 --- a/man/mkcephfs.8 +++
 b/man/mkcephfs.8 @@ -32,7 +32,7 @@ level margin:
 \\n[rst2man-indent\\n[rst2man-indent-level]] . .SH SYNOPSIS .nf 
 -\fBmkcephfs\fP [ \-c \fIceph.conf\fP ] [ \-\-mkbtrfs ] [ \-a,
 \-\-all\-hosts [ \-k +\fBmkcephfs\fP [ \-c \fIceph.conf\fP ] [
 \-\-mkfs ] [ \-a, \-\-all\-hosts [ \-k 
 \fI/path/to/admin.keyring\fP ] ] .fi .sp @@ -111,19 +111,16 @@
 config file). .UNINDENT .INDENT 0.0 .TP -.B \-\-mkbtrfs -Create
 and mount the any btrfs file systems specified in the -ceph.conf
 for OSD data storage using mkfs.btrfs. The btrfs devs -and (if
 it differs from osd data) btrfs path options must be +.B
 \-\-mkfs +Create and mount any file systems specified in the 
 +ceph.conf for OSD data storage using mkfs.*. The devs +and (if
 it differs from osd data) fs path options must be defined. 
 .sp \fBNOTE\fP Btrfs is still considered experimental.  This
 option -can ease some configuration pain, but is the use of btrfs
 is not +can ease some configuration pain, but the use of this
 option is not required when \fBosd data\fP directories are
 mounted manually by the adminstrator. -.sp -\fBNOTE\fP This
 option is deprecated and will be removed in a future -release. 
 .UNINDENT .INDENT 0.0 .TP diff --git a/src/ceph.conf.twoosds
 b/src/ceph.conf.twoosds index c0cfc68..05ca754 100644 ---
 a/src/ceph.conf.twoosds +++ b/src/ceph.conf.twoosds @@ -67,7
 +67,8 @@ debug journal = 20 log dir = /data/cosd$id osd data =
 /mnt/osd$id -btrfs options = flushoncommit,usertrans + fs
 options = flushoncommit,usertrans +fstype = btrfs ;user =
 root
 
 ;osd journal = /mnt/osd$id/journal @@ -75,8 +76,8 @@ osd journal
 = /dev/disk/by-path/pci-:05:02.0-scsi-6:0:0:0 ;filestore
 max sync interval = 1
 
 -btrfs devs = /dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0 
 -;   btrfs devs = /dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0
 \ +  devs = /dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0 +;
 devs = /dev/disk/by-path/pci-:05:01.0-scsi-2:0:0:0 \ ;
 /dev/disk/by-path/pci-:05:01.0-scsi-3:0:0:0 \ ;
 /dev/disk/by-path/pci-:05:01.0-scsi-4:0:0:0 \ ;
 /dev/disk/by-path/pci-:05:01.0-scsi-5:0:0:0 \ diff --git
 a/src/init-ceph.in b/src/init-ceph.in index a8c5a29..32bcc9a
 100644 --- a/src/init-ceph.in +++ b/src/init-ceph.in @@ -100,8
 +100,8 @@ docrun= allhosts=0 debug=0 monaddr= 

'zombie snapshot' problem

2012-11-21 Thread Andrey Korolyov
Hi,

Somehow I have managed to produce unkillable snapshot, which does not
allow to remove itself or parent image:

$ rbd snap purge dev-rack0/vm2
Removing all snapshots: 100% complete...done.
$ rbd rm dev-rack0/vm2
2012-11-21 16:31:24.184626 7f7e0d172780 -1 librbd: image has snapshots
- not removing
Removing image: 0% complete...failed.
rbd: image has snapshots - these must be deleted with 'rbd snap purge'
before the image can be removed.
$ rbd snap ls dev-rack0/vm2
SNAPID NAME   SIZE
   188 vm2.snap-yxf 16384 MB
$ rbd info dev-rack0/vm2
rbd image 'vm2':
size 16384 MB in 4096 objects
order 22 (4096 KB objects)
block_name_prefix: rbd_data.1fa164c960874
format: 2
features: layering
$ rbd snap rm --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to remove snapshot: (2) No such file or directory
$ rbd snap create --snap vm2.snap-yxf dev-rack0/vm2
rbd: failed to create snapshot: (17) File exists
$ rbd snap rollback --snap vm2.snap-yxf dev-rack0/vm2
Rolling back to snapshot: 100% complete...done.
$ rbd snap protect --snap vm2.snap-yxf dev-rack0/vm2
$ rbd snap unprotect --snap vm2.snap-yxf dev-rack0/vm2


Meanwhile, ``rbd ls -l dev-rack0''  segfaulting with an attached log.
Is there any reliable way to kill problematic snap?


log-crash.txt.gz
Description: GNU Zip compressed data


RBD Backup

2012-11-21 Thread Stefan Priebe - Profihost AG

Hello list,

is there a recommanded way to backup rbd images / disks?

Or is it just
rbd snap create BACKUP
rbd export BACKUP
rbd snap rm BACKUP

Is the snap needed at all? Or is an export save? Is there a way to make 
sure the image is consistent?


Is it possible to use the BACKUP file as a loop device or something else 
so that i'm able to mount the partitions from the backup file?


Thanks!

Greets Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-21 Thread Wido den Hollander

Hi,

On 11/21/2012 09:37 PM, Stefan Priebe - Profihost AG wrote:

Hello list,

is there a recommanded way to backup rbd images / disks?

Or is it just
rbd snap create BACKUP
rbd export BACKUP


You should use:

rbd export --snap BACKUP img dest


rbd snap rm BACKUP

Is the snap needed at all? Or is an export save? Is there a way to make
sure the image is consistent?



While reading rbd.cc it doesn't seem like running export on a running VM 
is safe, so you should snapshot before.


The snapshot isn't consistent since it has no way of telling the VM to 
flush it's buffers.


To make it consistent you have to run sync (In the VM) just prior to 
creating the snapshot.



Is it possible to use the BACKUP file as a loop device or something else
so that i'm able to mount the partitions from the backup file?



You can do something like:

rbd export --snap BACKUP image1 /mnt/backup/image1.img
losetup /mnt/backup/image1.img
kpartx -a /dev/loop0

Now you will have the partitions from the RBD image available in 
/dev/mapper/loop0pX


Wido


Thanks!

Greets Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-21 Thread Stefan Priebe - Profihost AG

Hi Wido,

thanks for all your explanations.

This doesn't seem to work:

rbd export --snap BACKUP img dest


rbd -p kvmpool1 export --snap BACKUP vm-101-disk-1 /vm-101-disk-1.img 


rbd: error setting snapshot context: (2) No such file or directory

Or should i still create and delete a snapshot named BACKUP before doing 
this?


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to create snapshots

2012-11-21 Thread Stefan Priebe - Profihost AG

Hello list,

i tried to create a snapshot of my disk vm-113-disk-1:

[: ~]# rbd -p kvmpool1 ls
vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create BACKUP vm-113-disk-1
rbd: extraneous parameter vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create vm-113-disk-1 BACKUP
rbd: extraneous parameter BACKUP

What's wrong here?

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to create snapshots

2012-11-21 Thread Wido den Hollander

Hi,

On 11/21/2012 10:07 PM, Stefan Priebe - Profihost AG wrote:

Hello list,

i tried to create a snapshot of my disk vm-113-disk-1:

[: ~]# rbd -p kvmpool1 ls
vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create BACKUP vm-113-disk-1
rbd: extraneous parameter vm-113-disk-1

[: ~]# rbd -p kvmpool1 snap create vm-113-disk-1 BACKUP
rbd: extraneous parameter BACKUP

What's wrong here?


Use:

$ rbd -p kvmpool1 snap create --image vm-113-disk-1 BACKUP

rbd -h also tells:

image-name, snap-name are [pool/]name[@snap], or you may specify
individual pieces of names with -p/--pool, --image, and/or --snap.

Never tried it, but you might be able to use:

$ rbd -p kvmpool1 snap create vm-113-disk-1@BACKUP

I don't have access to a running Ceph cluster now to verify this.

Wido



Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD Backup

2012-11-21 Thread Wido den Hollander

Hi,

On 11/21/2012 09:56 PM, Stefan Priebe - Profihost AG wrote:

Hi Wido,

thanks for all your explanations.

This doesn't seem to work:

rbd export --snap BACKUP img dest


rbd -p kvmpool1 export --snap BACKUP vm-101-disk-1 /vm-101-disk-1.img
rbd: error setting snapshot context: (2) No such file or directory

Or should i still create and delete a snapshot named BACKUP before doing
this?



Yes, you should create the snapshot first before exporting it. Export 
does not create the snapshot for you.


Wido


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD fio Performance concerns

2012-11-21 Thread Mark Nelson

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are 
getting higher performance with random reads/writes vs sequential!  It 
would be interesting to see what kind of throughput smalliobench reports 
(should be packaged in bobtail) and also see if this behavior happens 
with cephfs.  It's still too early in the morning for me right now to 
come up with a reasonable explanation for what's going on.  It might be 
worth running blktrace and seekwatcher to see what the io patterns on 
the underlying disk look like in each case.  Maybe something unexpected 
is going on.


Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:

Which iodepth did you use for those benchs?



I really don't understand why I can't get more rand read iops with 4K block ...


Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER aderum...@odiso.com wrote:

@Alexandre: is it the same for you? or do you always get more IOPS with seq?


rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)


I really don't understand why I can't get more rand read iops with 4K block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around 15% on 
cluster during read bench)


- Mail original -

De: Sébastien Han han.sebast...@gmail.com
À: Mark Kampe mark.ka...@inktank.com
Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel 
ceph-devel@vger.kernel.org
Envoyé: Lundi 19 Novembre 2012 19:03:40
Objet: Re: RBD fio Performance concerns

@Sage, thanks for the info :)
@Mark:


If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).


The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.


We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.


I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com wrote:

Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object. All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write aggregation.



That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
fio 1.59
Starting 4 processes
Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta
02m:59s]
seq-read: (groupid=0, jobs=1): err= 0: pid=15096
read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec
slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90
clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63
lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62
bw (KB/s) : min= 0, max=14406, per=31.89%, avg=4258.24,
stdev=6239.06
cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,

=64=100.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,

=64=0.0%

complete 

Re: RBD fio Performance concerns

2012-11-21 Thread Mark Nelson

Responding to my own message. :)

Talked to Sage a bit offline about this.  I think there are two opposing 
forces:


On one hand, random IO may be spreading reads/writes out across more 
OSDs than sequential IO that presumably would be hitting a single OSD 
more regularly.


On the other hand, you'd expect that sequential writes would be getting 
coalesced either at the RBD layer or on the OSD, and that the 
drive/controller/filesystem underneath the OSD would be doing some kind 
of readahead or prefetching.


On the third hand, maybe coalescing/prefetching is in fact happening but 
we are IOP limited by some per-osd limitation.


It could be interesting to do the test with a single OSD and see what 
happens.


Mark

On 11/21/2012 09:52 AM, Mark Nelson wrote:

Hi Guys,

I'm late to this thread but thought I'd chime in.  Crazy that you are
getting higher performance with random reads/writes vs sequential!  It
would be interesting to see what kind of throughput smalliobench reports
(should be packaged in bobtail) and also see if this behavior happens
with cephfs.  It's still too early in the morning for me right now to
come up with a reasonable explanation for what's going on.  It might be
worth running blktrace and seekwatcher to see what the io patterns on
the underlying disk look like in each case.  Maybe something unexpected
is going on.

Mark

On 11/19/2012 02:57 PM, Sébastien Han wrote:

Which iodepth did you use for those benchs?



I really don't understand why I can't get more rand read iops with 4K
block ...


Me neither, hope to get some clarification from the Inktank guys. It
doesn't make any sense to me...
--
Bien cordialement.
Sébastien HAN.


On Mon, Nov 19, 2012 at 8:11 PM, Alexandre DERUMIER
aderum...@odiso.com wrote:

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?


rand read 4K : 6000 iops
seq read 4K : 3500 iops
seq read 4M : 31iops (1gigabit client bandwith limit)

rand write 4k: 6000iops  (tmpfs journal)
seq write 4k: 1600iops
seq write 4M : 31iops (1gigabit client bandwith limit)


I really don't understand why I can't get more rand read iops with 4K
block ...

I try with high end cpu for client, it doesn't change nothing.
But test cluster use  old 8 cores E5420  @ 2.50GHZ (But cpu is around
15% on cluster during read bench)


- Mail original -

De: Sébastien Han han.sebast...@gmail.com
À: Mark Kampe mark.ka...@inktank.com
Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel
ceph-devel@vger.kernel.org
Envoyé: Lundi 19 Novembre 2012 19:03:40
Objet: Re: RBD fio Performance concerns

@Sage, thanks for the info :)
@Mark:


If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).


The original benchmark has been performed with 4M block size. And as
you can see I still get more IOPS with rand than seq... I just tried
with 4M without direct I/O, still the same. I can print fio results if
it's needed.


We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.


I know why I use direct I/O. It's synthetic benchmarks, it's far away
from a real life scenario and how common applications works. I just
try to see the maximum I/O throughput that I can get from my RBD. All
my applications use buffered I/O.

@Alexandre: is it the same for you? or do you always get more IOPS
with seq?

Thanks to all of you..


On Mon, Nov 19, 2012 at 5:54 PM, Mark Kampe mark.ka...@inktank.com
wrote:

Recall:
1. RBD volumes are striped (4M wide) across RADOS objects
2. distinct writes to a single RADOS object are serialized

Your sequential 4K writes are direct, depth=256, so there are
(at all times) 256 writes queued to the same object. All of
your writes are waiting through a very long line, which is adding
horrendous latency.

If you want to do sequential I/O, you should do it buffered
(so that the writes can be aggregated) or with a 4M block size
(very efficient and avoiding object serialization).

We do direct writes for benchmarking, not because it is a reasonable
way to do I/O, but because it bypasses the buffer cache and enables
us to directly measure cluster I/O throughput (which is what we are
trying to optimize). Applications should usually do buffered I/O,
to get the (very significant) benefits of caching and write
aggregation.



That's correct for some of the benchmarks. However even with 4K for
seq, I still get less IOPS. See below my last fio:

# fio rbd-bench.fio
seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256
rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio,
iodepth=256
seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, 

Re: Hadoop and Ceph client/mds view of modification time

2012-11-21 Thread Noah Watkins
(Sorry for the dupe message. vger rejected due to HTML).

Thanks, I'll try this patch this morning.

Client B should perform a single stat after a notification from Client
A. But, won't Sage's patch still be required, since Client A needs the
MDS time to pass to Client B?

On Tue, Nov 20, 2012 at 12:20 PM, Sam Lang sam.l...@inktank.com wrote:
 On 11/20/2012 01:44 PM, Noah Watkins wrote:

 This is a description of the clock synchronization issue we are facing
 in Hadoop:

 Components of Hadoop use mtime as a versioning mechanism. Here is an
 example where Client B tests the expected 'version' of a file created
 by Client A:

Client A: create file, write data into file.
Client A: expected_mtime -- lstat(file)
Client A: broadcast expected_mtime to client B
...
Client B: mtime -- lstat(file)
Client B: test expected_mtime == mtime


 Here's a patch that might work to push the setattr out to the mds every time
 (the same as Sage's patch for getattr).  This isn't quite writeback, as it
 waits for the setattr at the server to complete before returning, but I
 think that's actually what you want in this case.  It needs to be enabled by
 setting client setattr writethru = true in the config.  Also, I haven't
 tested that it sends the setattr, just a basic test of functionality.

 BTW, if its always client B's first stat of the file, you won't need Sage's
 patch.

 -sam

 diff --git a/src/client/Client.cc b/src/client/Client.cc
 index 8d4a5ac..a7dd8f7 100644
 --- a/src/client/Client.cc
 +++ b/src/client/Client.cc
 @@ -4165,6 +4165,7 @@ int Client::_getattr(Inode *in, int mask, int uid, int
 gid)

  int Client::_setattr(Inode *in, struct stat *attr, int mask, int uid, int
 gid)
  {
 +  int orig_mask = mask;
int issued = in-caps_issued();

ldout(cct, 10)  _setattr mask   mask   issued  
 ccap_string(issued)  dendl;
 @@ -4219,7 +4220,7 @@ int Client::_setattr(Inode *in, struct stat *attr, int
 mask, int uid, int gid)
mask = ~(CEPH_SETATTR_MTIME|CEPH_SETATTR_ATIME);
  }
}
 -  if (!mask)
 +  if (!cct-_conf-client_setattr_writethru  !mask)
  return 0;

MetaRequest *req = new MetaRequest(CEPH_MDS_OP_SETATTR);
 @@ -4229,6 +4230,10 @@ int Client::_setattr(Inode *in, struct stat *attr,
 int mask, int uid, int gid)
req-set_filepath(path);
req-inode = in;

 +  // reset mask back to original if we're meant to do writethru
 +  if (cct-_conf-client_setattr_writethru)
 +mask = orig_mask;
 +
if (mask  CEPH_SETATTR_MODE) {
  req-head.args.setattr.mode = attr-st_mode;
  req-inode_drop |= CEPH_CAP_AUTH_SHARED;
 diff --git a/src/common/config_opts.h b/src/common/config_opts.h
 index cc05095..51a2769 100644
 --- a/src/common/config_opts.h
 +++ b/src/common/config_opts.h
 @@ -178,6 +178,7 @@ OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 100)
 // MB * n  (dirty OR tx.
  OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
 this smallish)
  OPTION(client_oc_max_dirty_age, OPT_DOUBLE, 5.0)  // max age in cache
 before writeback
  OPTION(client_oc_max_objects, OPT_INT, 1000)  // max objects in cache
 +OPTION(client_setattr_writethru, OPT_BOOL, false)  // send the attributes
 to the mds server
  // note: the max amount of in flight dirty data is roughly (max - target)
  OPTION(fuse_use_invalidate_cb, OPT_BOOL, false) // use fuse 2.8+ invalidate
 callback to keep page cache consistent
  OPTION(fuse_big_writes, OPT_BOOL, true)



 Since mtime may be set in Ceph by both client and MDS, inconsistent
 mtime view is possible when clocks are not adequately synchronized.

 Here is a test that reproduces the problem. In the following output,
 issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time
 set to an hour earlier than the MDS node.

 nwatkins@issdm-22:~$ ssh issdm-18 date  ./test
 Tue Nov 20 11:40:28 PST 2012   // MDS TIME
 local time: Tue Nov 20 10:42:47 2012  // Client TIME
 fstat time: Tue Nov 20 11:40:28 2012  // mtime seen after file
 creation (MDS time)
 lstat time: Tue Nov 20 10:42:47 2012  // mtime seen after file write
 (client time)

 Here is the code used to produce that output.

 #include errno.h
 #include sys/fcntl.h
 #include sys/time.h
 #include unistd.h
 #include sys/types.h
 #include sys/stat.h
 #include dirent.h
 #include sys/xattr.h
 #include stdio.h
 #include string.h
 #include assert.h
 #include cephfs/libcephfs.h
 #include time.h

 int main(int argc, char **argv)
 {
  struct stat st;
  struct ceph_mount_info *cmount;
  struct timeval tv;

  /* setup */
  ceph_create(cmount, admin);
  ceph_conf_read_file(cmount,
 /users/nwatkins/Projects/ceph.conf);
  ceph_mount(cmount, /);

  /* print local time for reference */
  gettimeofday(tv, NULL);
  printf(local time: %s, ctime(tv.tv_sec));

  /* create a file */
  char buf[256];
  sprintf(buf, /somefile.%d, getpid());
  int fd = 

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Sage Weil
On Tue, 20 Nov 2012, Nick Bartos wrote:
 Since I now have a decent script which can reproduce this, I decided
 to re-test with the same 3.5.7 kernel, but just not applying the
 patches from the wip-3.5 branch.  With the patches, I can only go 2
 builds before I run into a hang.  Without the patches, I have gone 9
 consecutive builds (and still going) without seeing the hang.  So it
 seems like a reasonable assumption that the problem was introduced in
 one of those patches.
 
 We started seeing the problem before applying all the 3.5 patches, so
 it seems like one of these is the culprit:
 
 1-libceph-encapsulate-out-message-data-setup.patch
 2-libceph-dont-mark-footer-complete-before-it-is.patch
 3-libceph-move-init-of-bio_iter.patch
 4-libceph-dont-use-bio_iter-as-a-flag.patch
 5-libceph-resubmit-linger-ops-when-pg-mapping-changes.patch
 6-libceph-re-initialize-bio_iter-on-start-of-message-receive.patch
 7-ceph-close-old-con-before-reopening-on-mds-reconnect.patch
 8-libceph-protect-ceph_con_open-with-mutex.patch
 9-libceph-reset-connection-retry-on-successfully-negotiation.patch
 10-rbd-only-reset-capacity-when-pointing-to-head.patch
 11-rbd-set-image-size-when-header-is-updated.patch
 12-libceph-fix-crypto-key-null-deref-memory-leak.patch
 13-ceph-tolerate-and-warn-on-extraneous-dentry-from-mds.patch
 14-ceph-avoid-divide-by-zero-in-__validate_layout.patch
 15-rbd-drop-dev-reference-on-error-in-rbd_open.patch
 16-ceph-Fix-oops-when-handling-mdsmap-that-decreases-max_mds.patch
 17-libceph-check-for-invalid-mapping.patch
 18-ceph-propagate-layout-error-on-osd-request-creation.patch
 19-rbd-BUG-on-invalid-layout.patch
 20-ceph-return-EIO-on-invalid-layout-on-GET_DATALOC-ioctl.patch
 21-ceph-avoid-32-bit-page-index-overflow.patch
 23-ceph-fix-dentry-reference-leak-in-encode_fh.patch
 
 I'll start doing some other builds to try and narrow down the patch
 introducing the problem more specifically.

Thanks for hunting this down.  I'm very curious what the culprit is...

sage



 
 
 On Tue, Nov 20, 2012 at 1:53 PM, Nick Bartos n...@pistoncloud.com wrote:
  I reproduced the problem and got several sysrq states captured.
  During this run, the monitor running on the host complained a few
  times about the clocks being off, but all messages were for under 0.55
  seconds.
 
  Here are the kernel logs.  Note that there are several traces, I
  thought multiple during the incident may help:
  https://raw.github.com/gist/4121395/a6dda7552ed8a45725ee5d632fe3ba38703f8cfc/gistfile1.txt
 
 
  On Mon, Nov 19, 2012 at 3:34 PM, Gregory Farnum g...@inktank.com wrote:
  Hmm, yep ? that param is actually only used for the warning; I guess
  we forgot what it actually covers. :(
 
  Have your monitor clocks been off by more than 5 seconds at any point?
 
  On Mon, Nov 19, 2012 at 3:04 PM, Nick Bartos n...@pistoncloud.com wrote:
  Making 'mon clock drift allowed' very small (0.1) does not
  reliably reproduce the hang.  I started looking at the code for 0.48.2
  and it looks like this is only used in Paxos::warn_on_future_time,
  which only handles the warning, nothing else.
 
 
  On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil s...@inktank.com wrote:
  On Fri, 16 Nov 2012, Nick Bartos wrote:
  Should I be lowering the clock drift allowed, or the lease interval to
  help reproduce it?
 
  clock drift allowed.
 
 
 
 
  On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote:
   You can safely set the clock drift allowed as high as 500ms.  The real
   limitation is that it needs to be well under the lease interval, 
   which is
   currently 5 seconds by default.
  
   You might be able to reproduce more easily by lowering the 
   threshold...
  
   sage
  
  
   On Fri, 16 Nov 2012, Nick Bartos wrote:
  
   How far off do the clocks need to be before there is a problem?  It
   would seem to be hard to ensure a very large cluster has all of it's
   nodes synchronized within 50ms (which seems to be the default for 
   mon
   clock drift allowed).  Does the mon clock drift allowed parameter
   change anything other than the log messages?  Are there any other
   tuning options that may help, assuming that this is the issue and 
   it's
   not feasible to get the clocks more than 500ms in sync between all
   nodes?
  
   I'm trying to get a good way of reproducing this and get a trace on
   the ceph processes to see what they're waiting on.  I'll let you know
   when I have more info.
  
  
   On Fri, Nov 16, 2012 at 11:16 AM, Sage Weil s...@inktank.com wrote:
I just realized I was mixing up this thread with the other deadlock
thread.
   
On Fri, 16 Nov 2012, Nick Bartos wrote:
Turns out we're having the 'rbd map' hang on startup again, after 
we
started using the wip-3.5 patch set.  How critical is the
libceph_protect_ceph_con_open_with_mutex commit?  That's the one I
removed before which seemed to get rid of the problem (although 
I'm
not completely sure if it completely got rid of 

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
It's really looking like it's the
libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
 So far I have gone through 4 successful installs with no hang with
only 1-49 applied.  I'm still leaving my test run to make sure it's
not a fluke, but since previously it hangs within the first couple of
builds, it really looks like this is where the problem originated.

1-libceph_eliminate_connection_state_DEAD.patch
2-libceph_kill_bad_proto_ceph_connection_op.patch
3-libceph_rename_socket_callbacks.patch
4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
6-libceph_start_separating_connection_flags_from_state.patch
7-libceph_start_tracking_connection_socket_state.patch
8-libceph_provide_osd_number_when_creating_osd.patch
9-libceph_set_CLOSED_state_bit_in_con_init.patch
10-libceph_embed_ceph_connection_structure_in_mon_client.patch
11-libceph_drop_connection_refcounting_for_mon_client.patch
12-libceph_init_monitor_connection_when_opening.patch
13-libceph_fully_initialize_connection_in_con_init.patch
14-libceph_tweak_ceph_alloc_msg.patch
15-libceph_have_messages_point_to_their_connection.patch
16-libceph_have_messages_take_a_connection_reference.patch
17-libceph_make_ceph_con_revoke_a_msg_operation.patch
18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
19-libceph_fix_overflow_in___decode_pool_names.patch
20-libceph_fix_overflow_in_osdmap_decode.patch
21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
22-libceph_transition_socket_state_prior_to_actual_connect.patch
23-libceph_fix_NULL_dereference_in_reset_connection.patch
24-libceph_use_con_get_put_methods.patch
25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
26-libceph_encapsulate_out_message_data_setup.patch
27-libceph_encapsulate_advancing_msg_page.patch
28-libceph_don_t_mark_footer_complete_before_it_is.patch
29-libceph_move_init_bio__functions_up.patch
30-libceph_move_init_of_bio_iter.patch
31-libceph_don_t_use_bio_iter_as_a_flag.patch
32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
33-libceph_don_t_change_socket_state_on_sock_event.patch
34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
35-libceph_don_t_touch_con_state_in_con_close_socket.patch
36-libceph_clear_CONNECTING_in_ceph_con_close.patch
37-libceph_clear_NEGOTIATING_when_done.patch
38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
39-libceph_separate_banner_and_connect_writes.patch
40-libceph_distinguish_two_phases_of_connect_sequence.patch
41-libceph_small_changes_to_messenger.c.patch
42-libceph_add_some_fine_ASCII_art.patch
43-libceph_set_peer_name_on_con_open_not_init.patch
44-libceph_initialize_mon_client_con_only_once.patch
45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
46-libceph_initialize_msgpool_message_types.patch
47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
48-libceph_report_socket_read_write_error_message.patch
49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch


On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil s...@inktank.com wrote:
 Thanks for hunting this down.  I'm very curious what the culprit is...

 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Weil

Am 20.11.2012 13:44, schrieb Stefan Priebe:

rbd / rados tends to return pretty often length of writes
or discarded blocks. These values might be bigger than int.

Signed-off-by: Stefan Priebe s.pri...@profihost.ag
---
  block/rbd.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f57d0c6..6bf9c2e 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -69,7 +69,7 @@ typedef enum {
  typedef struct RBDAIOCB {
  BlockDriverAIOCB common;
  QEMUBH *bh;
-int ret;
+int64_t ret;
  QEMUIOVector *qiov;
  char *bounce;
  RBDAIOCmd cmd;
@@ -87,7 +87,7 @@ typedef struct RADOSCB {
  int done;
  int64_t size;
  char *buf;
-int ret;
+int64_t ret;
  } RADOSCB;
  
  #define RBD_FD_READ 0



Why do you use int64_t instead of off_t?
If the value is related to file sizes, off_t would be a good choice.

Stefan W.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
With 8 successful installs already done, I'm reasonably confident that
it's patch #50.  I'm making another build which applies all patches
from the 3.5 backport branch, excluding that specific one.  I'll let
you know if that turns up any unexpected failures.

What will the potential fall out be for removing that specific patch?


On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com wrote:
 It's really looking like it's the
 libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
 patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
  So far I have gone through 4 successful installs with no hang with
 only 1-49 applied.  I'm still leaving my test run to make sure it's
 not a fluke, but since previously it hangs within the first couple of
 builds, it really looks like this is where the problem originated.

 1-libceph_eliminate_connection_state_DEAD.patch
 2-libceph_kill_bad_proto_ceph_connection_op.patch
 3-libceph_rename_socket_callbacks.patch
 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
 6-libceph_start_separating_connection_flags_from_state.patch
 7-libceph_start_tracking_connection_socket_state.patch
 8-libceph_provide_osd_number_when_creating_osd.patch
 9-libceph_set_CLOSED_state_bit_in_con_init.patch
 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
 11-libceph_drop_connection_refcounting_for_mon_client.patch
 12-libceph_init_monitor_connection_when_opening.patch
 13-libceph_fully_initialize_connection_in_con_init.patch
 14-libceph_tweak_ceph_alloc_msg.patch
 15-libceph_have_messages_point_to_their_connection.patch
 16-libceph_have_messages_take_a_connection_reference.patch
 17-libceph_make_ceph_con_revoke_a_msg_operation.patch
 18-libceph_make_ceph_con_revoke_message_a_msg_op.patch
 19-libceph_fix_overflow_in___decode_pool_names.patch
 20-libceph_fix_overflow_in_osdmap_decode.patch
 21-libceph_fix_overflow_in_osdmap_apply_incremental.patch
 22-libceph_transition_socket_state_prior_to_actual_connect.patch
 23-libceph_fix_NULL_dereference_in_reset_connection.patch
 24-libceph_use_con_get_put_methods.patch
 25-libceph_drop_ceph_con_get_put_helpers_and_nref_member.patch
 26-libceph_encapsulate_out_message_data_setup.patch
 27-libceph_encapsulate_advancing_msg_page.patch
 28-libceph_don_t_mark_footer_complete_before_it_is.patch
 29-libceph_move_init_bio__functions_up.patch
 30-libceph_move_init_of_bio_iter.patch
 31-libceph_don_t_use_bio_iter_as_a_flag.patch
 32-libceph_SOCK_CLOSED_is_a_flag_not_a_state.patch
 33-libceph_don_t_change_socket_state_on_sock_event.patch
 34-libceph_just_set_SOCK_CLOSED_when_state_changes.patch
 35-libceph_don_t_touch_con_state_in_con_close_socket.patch
 36-libceph_clear_CONNECTING_in_ceph_con_close.patch
 37-libceph_clear_NEGOTIATING_when_done.patch
 38-libceph_define_and_use_an_explicit_CONNECTED_state.patch
 39-libceph_separate_banner_and_connect_writes.patch
 40-libceph_distinguish_two_phases_of_connect_sequence.patch
 41-libceph_small_changes_to_messenger.c.patch
 42-libceph_add_some_fine_ASCII_art.patch
 43-libceph_set_peer_name_on_con_open_not_init.patch
 44-libceph_initialize_mon_client_con_only_once.patch
 45-libceph_allow_sock_transition_from_CONNECTING_to_CLOSED.patch
 46-libceph_initialize_msgpool_message_types.patch
 47-libceph_prevent_the_race_of_incoming_work_during_teardown.patch
 48-libceph_report_socket_read_write_error_message.patch
 49-libceph_fix_mutex_coverage_for_ceph_con_close.patch
 50-libceph_resubmit_linger_ops_when_pg_mapping_changes.patch


 On Wed, Nov 21, 2012 at 8:50 AM, Sage Weil s...@inktank.com wrote:
 Thanks for hunting this down.  I'm very curious what the culprit is...

 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Files lost after mds rebuild

2012-11-21 Thread Gregory Farnum
On Tue, Nov 20, 2012 at 8:28 PM, Drunkard Zhang gongfan...@gmail.com wrote:
 2012/11/21 Gregory Farnum g...@inktank.com:
 No, absolutely not. There is no relationship between different RADOS
 pools. If you've been using the cephfs tool to place some filesystem
 data in different pools then your configuration is a little more
 complicated (have you done that?), but deleting one pool is never
 going to remove data from the others.
 -Greg

 I think that should be a bug. Here's the story I did:
 I created one directory 'audit' in running ceph filesystem, and put
 some data into the directory (about 100GB) before these commands:
 ceph osd pool create audit
 ceph mds add_data_pool 4
 cephfs /mnt/temp/audit/ set_layout -p 4

 log3 ~ # ceph osd dump | grep audit
 pool 4 'audit' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
 8 pgp_num 8 last_change 1558 owner 0

 at this time, all data in audit still usable, after 'ceph osd pool
 delete data', the disk space recycled (forgot to test if the data
 still usable), only 200MB used, from 'ceph -s'. So, here's what I'm
 thinking, the data stored before pool created won't follow the pool,
 it still follows the default pool 'data', is this a bug, or intended
 behavior?

Oh, I see. Data is not moved when you set directory layouts; it only
impacts files created after that point. This is intended behavior —
Ceph would need to copy the data around anyway in order to make it
follow the pool. There's no sense in hiding that from the user,
especially given the complexity involved in doing so safely —
especially when there are many use cases where you want the files in
different pools.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-21 Thread Gregory Farnum
On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 So, not possible use ceph as scalable block device without visualization?

I'm not sure I understand, but if you're trying to take a bunch of
compute nodes and glue their disks together, no, that's not a
supported use case at this time. There are a number of deadlock issues
caused by this sort of loopback; it's the same reason you shouldn't
mount NFS on the server host.
We may in the future manage to release an rbd-fuse client that you can
use to do this with a little less pain, but it's not ready at this
point.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-21 Thread ruslan usifov
Yes i mean exactly this. it's a great pity :-( Maybe present some ceph
equivalent that solve my problem?

2012/11/21 Gregory Farnum g...@inktank.com:
 On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov ruslan.usi...@gmail.com 
 wrote:
 So, not possible use ceph as scalable block device without visualization?

 I'm not sure I understand, but if you're trying to take a bunch of
 compute nodes and glue their disks together, no, that's not a
 supported use case at this time. There are a number of deadlock issues
 caused by this sort of loopback; it's the same reason you shouldn't
 mount NFS on the server host.
 We may in the future manage to release an rbd-fuse client that you can
 use to do this with a little less pain, but it's not ready at this
 point.
 -Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does still not recommended place rbd device on nodes, where osd daemon located?

2012-11-21 Thread Dan Mick
Still not certain I'm understanding *just* what you mean, but I'll point 
out that you can set up a cluster with rbd images, mount them from a 
separate non-virtualized host with kernel rbd, and expand those images 
and take advantage of the newly-available space on the separate host, 
just as though you were expanding a RAID device.  Maybe that fits your 
use case, Ruslan?


On 11/21/2012 12:05 PM, ruslan usifov wrote:

Yes i mean exactly this. it's a great pity :-( Maybe present some ceph
equivalent that solve my problem?

2012/11/21 Gregory Farnum g...@inktank.com:

On Wed, Nov 21, 2012 at 4:33 AM, ruslan usifov ruslan.usi...@gmail.com wrote:

So, not possible use ceph as scalable block device without visualization?


I'm not sure I understand, but if you're trying to take a bunch of
compute nodes and glue their disks together, no, that's not a
supported use case at this time. There are a number of deadlock issues
caused by this sort of loopback; it's the same reason you shouldn't
mount NFS on the server host.
We may in the future manage to release an rbd-fuse client that you can
use to do this with a little less pain, but it's not ready at this
point.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] use int64_t for return values from rbd instead of int

2012-11-21 Thread Stefan Priebe - Profihost AG
Not sure about off_t. What is min and max size?

Stefan

Am 21.11.2012 um 18:03 schrieb Stefan Weil s...@weilnetz.de:

 Am 20.11.2012 13:44, schrieb Stefan Priebe:
 rbd / rados tends to return pretty often length of writes
 or discarded blocks. These values might be bigger than int.
 
 Signed-off-by: Stefan Priebe s.pri...@profihost.ag
 ---
  block/rbd.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/block/rbd.c b/block/rbd.c
 index f57d0c6..6bf9c2e 100644
 --- a/block/rbd.c
 +++ b/block/rbd.c
 @@ -69,7 +69,7 @@ typedef enum {
  typedef struct RBDAIOCB {
  BlockDriverAIOCB common;
  QEMUBH *bh;
 -int ret;
 +int64_t ret;
  QEMUIOVector *qiov;
  char *bounce;
  RBDAIOCmd cmd;
 @@ -87,7 +87,7 @@ typedef struct RADOSCB {
  int done;
  int64_t size;
  char *buf;
 -int ret;
 +int64_t ret;
  } RADOSCB;
#define RBD_FD_READ 0
 
 
 Why do you use int64_t instead of off_t?
 If the value is related to file sizes, off_t would be a good choice.
 
 Stefan W.
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html