Re: page allocation failures on osd nodes

2013-01-28 Thread Sam Lang
On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote:

 Ahem. once on almost empty node same trace produced by qemu
 process(which was actually pinned to the specific numa node), so seems
 that`s generally is a some scheduler/mm bug, not directly related to
 the osd processes. In other words, the less percentage of memory
 actually is an RSS, the more is a probability of such allocation
 failure.

This might be a known bug in xen for your kernel?  The xen users list
might be able to help.
-sam
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RadosGW performance and disk space usage

2013-01-28 Thread Joao Eduardo Luis

On 01/27/2013 11:10 PM, Cesar Mello wrote:

Hi,

Just tried rest-bench. This little tool is wonderful, thanks!

I still have to learn lots of things. So please don't spend much time
explaining me, but instead please give me any pointers to
documentation or source code that can be useful. As a curiosity, I'm
pasting the results from my laptop. I'll repeat the same tests using
my desktop as the server.

Notice there is an assert being triggered, so I guess I'm running a
build with debugging code ?!. I compiled using ./configure
--with-radosgw --with-rest-bench followed by make.


asserts are usually used to mark invariants on the code logic, and are 
always built, regardless debugging being enabled or disabled.  Given you 
are hitting one, it probably means something is not quite right (might 
be a bug, or some invariant was broken for some reason).



common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()'
thread 7f1211401780 time 2013-01-27 20:51:01.196525
common/WorkQueue.cc: 59: FAILED assert(_threads.empty())
  ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f)
  1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c]
  2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021]
  3: (main()+0x75b) [0x42521b]
  4: (__libc_start_main()+0xed) [0x7f120f37576d]
  5: rest-bench() [0x426079]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.


Looks like http://tracker.newdream.net/issues/3896

Am not sure who should be made aware of this issue though. Maybe Yehuda 
(cc'ing)?


  -Joao



2013-01-27 20:51:01.197253 7f1211401780 -1 common/WorkQueue.cc: In
function 'virtual ThreadPool::~ThreadPool()' thread 7f1211401780 time
2013-01-27 20:51:01.196525
common/WorkQueue.cc: 59: FAILED assert(_threads.empty())

  ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f)
  1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c]
  2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021]
  3: (main()+0x75b) [0x42521b]
  4: (__libc_start_main()+0xed) [0x7f120f37576d]
  5: rest-bench() [0x426079]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
-11 2013-01-27 20:49:09.292227 7f1211401780  5 asok(0x29dc270)
register_command perfcounters_dump hook 0x29dc440
-10 2013-01-27 20:49:09.292259 7f1211401780  5 asok(0x29dc270)
register_command 1 hook 0x29dc440
 -9 2013-01-27 20:49:09.292262 7f1211401780  5 asok(0x29dc270)
register_command perf dump hook 0x29dc440
 -8 2013-01-27 20:49:09.292271 7f1211401780  5 asok(0x29dc270)
register_command perfcounters_schema hook 0x29dc440
 -7 2013-01-27 20:49:09.292275 7f1211401780  5 asok(0x29dc270)
register_command 2 hook 0x29dc440
 -6 2013-01-27 20:49:09.292278 7f1211401780  5 asok(0x29dc270)
register_command perf schema hook 0x29dc440
 -5 2013-01-27 20:49:09.292285 7f1211401780  5 asok(0x29dc270)
register_command config show hook 0x29dc440
 -4 2013-01-27 20:49:09.292290 7f1211401780  5 asok(0x29dc270)
register_command config set hook 0x29dc440
 -3 2013-01-27 20:49:09.292292 7f1211401780  5 asok(0x29dc270)
register_command log flush hook 0x29dc440
 -2 2013-01-27 20:49:09.292296 7f1211401780  5 asok(0x29dc270)
register_command log dump hook 0x29dc440
 -1 2013-01-27 20:49:09.292300 7f1211401780  5 asok(0x29dc270)
register_command log reopen hook 0x29dc440
  0 2013-01-27 20:51:01.197253 7f1211401780 -1
common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()'
thread 7f1211401780 time 2013-01-27 20:51:01.196525
common/WorkQueue.cc: 59: FAILED assert(_threads.empty())

  ceph version 0.56-464-gfa421cf (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f)
  1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c]
  2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021]
  3: (main()+0x75b) [0x42521b]
  4: (__libc_start_main()+0xed) [0x7f120f37576d]
  5: rest-bench() [0x426079]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
   -2/-2 (syslog threshold)
   99/99 (stderr threshold)
   max_recent10
   max_new 1000
   log_file
--- end dump of recent events ---
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
  in thread 

[PATCH 1/3] configure: fix check for fuse_getgroups()

2013-01-28 Thread Danny Al-Gaaf
Check for fuse_getgroups() only in case we have found libfuse already.
Moved the check to the check for --with-fuse.

Small fix: fix string for NO_ATOMIC_OPS, don't use '.

Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de
---
 configure.ac | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/configure.ac b/configure.ac
index ffbd150..d6cff80 100644
--- a/configure.ac
+++ b/configure.ac
@@ -241,6 +241,8 @@ AS_IF([test x$with_fuse != xno],
AC_DEFINE([HAVE_LIBFUSE], [1],
  [Define if you have fuse])
HAVE_LIBFUSE=1
+  # look for fuse_getgroups and define FUSE_GETGROUPS if found
+  AC_CHECK_FUNCS([fuse_getgroups])
   ],
  [AC_MSG_FAILURE(
[no FUSE found (use --without-fuse to disable)])])])
@@ -391,7 +393,8 @@ AS_IF([test x$with_libatomic_ops != xno],
   ])])
 AS_IF([test $HAVE_ATOMIC_OPS = 1],
[],
-   AC_DEFINE([NO_ATOMIC_OPS], [1], [Defined if you don't have atomic_ops]))
+   [AC_DEFINE([NO_ATOMIC_OPS], [1], [Defined if you do not have 
atomic_ops])])
+
 AM_CONDITIONAL(WITH_LIBATOMIC, [test $HAVE_ATOMIC_OPS = 1])
 
 # newsyn?  requires mpi.
@@ -417,9 +420,6 @@ AS_IF([test x$with_system_leveldb = xcheck],
[AC_CHECK_LIB([leveldb], [leveldb_open], [with_system_leveldb=yes], 
[], [-lsnappy -lpthread])])
 AM_CONDITIONAL(WITH_SYSTEM_LEVELDB, [ test $with_system_leveldb = yes ])
 
-# look for fuse_getgroups and define FUSE_GETGROUPS if found
-AC_CHECK_FUNCS([fuse_getgroups])
-
 # use system libs3?
 AC_ARG_WITH([system-libs3],
[AS_HELP_STRING([--with-system-libs3], [use system libs3])],
-- 
1.8.1.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] fix some rbd-fuse related issues

2013-01-28 Thread Danny Al-Gaaf
Here three patches to fix some issues with the new rbd-fuse
code and an issues with the fuse handling in configure.

Danny Al-Gaaf (3):
  configure: fix check for fuse_getgroups()
  rbd-fuse: fix usage of conn-want
  rbd-fuse: fix printf format for off_t and size_t

 configure.ac|  8 
 src/rbd_fuse/rbd-fuse.c | 12 +++-
 2 files changed, 11 insertions(+), 9 deletions(-)

-- 
1.8.1.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] rbd-fuse: fix printf format for off_t and size_t

2013-01-28 Thread Danny Al-Gaaf
Fix printf format for off_t and size_t to print the same on 32 and 64bit
systems. Use PRI* macros from inttypes.h.

Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de
---
 src/rbd_fuse/rbd-fuse.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/rbd_fuse/rbd-fuse.c b/src/rbd_fuse/rbd-fuse.c
index c204463..748976a 100644
--- a/src/rbd_fuse/rbd-fuse.c
+++ b/src/rbd_fuse/rbd-fuse.c
@@ -15,6 +15,7 @@
 #include sys/types.h
 #include unistd.h
 #include getopt.h
+#include inttypes.h
 
 #include include/rbd/librbd.h
 
@@ -321,7 +322,7 @@ static int rbdfs_write(const char *path, const char *buf, 
size_t size,
 
if (offset + size  rbdsize(fi-fh)) {
int r;
-   fprintf(stderr, rbdfs_write resizing %s to 0x%lx\n,
+   fprintf(stderr, rbdfs_write resizing %s to 
0x%PRIxMAX\n,
path, offset+size);
r = rbd_resize(rbd-image, offset+size);
if (r  0)
@@ -516,7 +517,7 @@ rbdfs_truncate(const char *path, off_t size)
return -ENOENT;
 
rbd = opentbl[fd];
-   fprintf(stderr, truncate %s to %ld (0x%lx)\n, path, size, size);
+   fprintf(stderr, truncate %s to %PRIdMAX (0x%PRIxMAX)\n, path, 
size, size);
r = rbd_resize(rbd-image, size);
if (r  0)
return r;
@@ -559,7 +560,7 @@ rbdfs_setxattr(const char *path, const char *name, const 
char *value,
for (ap = attrs; ap-attrname != NULL; ap++) {
if (strcmp(name, ap-attrname) == 0) {
*ap-attrvalp = strtoull(value, NULL, 0);
-   fprintf(stderr, rbd-fuse: %s set to 0x%lx\n,
+   fprintf(stderr, rbd-fuse: %s set to 0x%PRIx64\n,
ap-attrname, *ap-attrvalp);
return 0;
}
@@ -578,7 +579,7 @@ rbdfs_getxattr(const char *path, const char *name, char 
*value,
 
for (ap = attrs; ap-attrname != NULL; ap++) {
if (strcmp(name, ap-attrname) == 0) {
-   sprintf(buf, %lu, *ap-attrvalp);
+   sprintf(buf, %PRIu64, *ap-attrvalp);
if (value != NULL  size = strlen(buf))
strcpy(value, buf);
fprintf(stderr, rbd-fuse: get %s\n, ap-attrname);
-- 
1.8.1.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] rbd-fuse: fix usage of conn-want

2013-01-28 Thread Danny Al-Gaaf
Fix usage of conn-want and FUSE_CAP_BIG_WRITES. Both need libfuse
version = 2.8. Encapsulate the related code line into a check for
the needed FUSE_VERSION as already done in ceph-fuse in some cases.

Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de
---
 src/rbd_fuse/rbd-fuse.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/rbd_fuse/rbd-fuse.c b/src/rbd_fuse/rbd-fuse.c
index b3e318f..c204463 100644
--- a/src/rbd_fuse/rbd-fuse.c
+++ b/src/rbd_fuse/rbd-fuse.c
@@ -461,8 +461,9 @@ rbdfs_init(struct fuse_conn_info *conn)
ret = rados_ioctx_create(cluster, pool_name, ioctx);
if (ret  0)
exit(91);
-
+#if FUSE_VERSION = FUSE_MAKE_VERSION(2, 8)
conn-want |= FUSE_CAP_BIG_WRITES;
+#endif
gotrados = 1;
 
// init's return value shows up in fuse_context.private_data,
-- 
1.8.1.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: page allocation failures on osd nodes

2013-01-28 Thread Andrey Korolyov
On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote:
 On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote:

 Ahem. once on almost empty node same trace produced by qemu
 process(which was actually pinned to the specific numa node), so seems
 that`s generally is a some scheduler/mm bug, not directly related to
 the osd processes. In other words, the less percentage of memory
 actually is an RSS, the more is a probability of such allocation
 failure.

 This might be a known bug in xen for your kernel?  The xen users list
 might be able to help.
 -sam

It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RadosGW performance and disk space usage

2013-01-28 Thread Yehuda Sadeh
On Mon, Jan 28, 2013 at 6:28 AM, Joao Eduardo Luis
joao.l...@inktank.com wrote:
 On 01/27/2013 11:10 PM, Cesar Mello wrote:

 Hi,

 Just tried rest-bench. This little tool is wonderful, thanks!

 I still have to learn lots of things. So please don't spend much time
 explaining me, but instead please give me any pointers to
 documentation or source code that can be useful. As a curiosity, I'm
 pasting the results from my laptop. I'll repeat the same tests using
 my desktop as the server.

 Notice there is an assert being triggered, so I guess I'm running a
 build with debugging code ?!. I compiled using ./configure
 --with-radosgw --with-rest-bench followed by make.


 asserts are usually used to mark invariants on the code logic, and are
 always built, regardless debugging being enabled or disabled.  Given you are
 hitting one, it probably means something is not quite right (might be a bug,
 or some invariant was broken for some reason).


 common/WorkQueue.cc: In function 'virtual ThreadPool::~ThreadPool()'
 thread 7f1211401780 time 2013-01-27 20:51:01.196525
 common/WorkQueue.cc: 59: FAILED assert(_threads.empty())
   ceph version 0.56-464-gfa421cf
 (fa421cf5f52ca16fa1328dbea2f4bda85c56cd3f)
   1: (ThreadPool::~ThreadPool()+0x10c) [0x43bf9c]
   2: (RESTDispatcher::~RESTDispatcher()+0xf1) [0x42a021]
   3: (main()+0x75b) [0x42521b]
   4: (__libc_start_main()+0xed) [0x7f120f37576d]
   5: rest-bench() [0x426079]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.


 Looks like http://tracker.newdream.net/issues/3896

Right, 3896. Probably some cleanup before shutdown issues.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RadosGW performance and disk space usage

2013-01-28 Thread Yehuda Sadeh
On Sun, Jan 27, 2013 at 3:10 PM, Cesar Mello cme...@gmail.com wrote:
 Hi,

 Just tried rest-bench. This little tool is wonderful, thanks!

 I still have to learn lots of things. So please don't spend much time
 explaining me, but instead please give me any pointers to
 documentation or source code that can be useful. As a curiosity, I'm
 pasting the results from my laptop. I'll repeat the same tests using
 my desktop as the server.

 Notice there is an assert being triggered, so I guess I'm running a
 build with debugging code ?!. I compiled using ./configure
 --with-radosgw --with-rest-bench followed by make.

 Thanks a lot for the attention.

 Best regards!
 Mello

 root@l3:/etc/init.d# rest-bench --api-host=localhost --bucket=test
 --access-key=JJABVJ3AWBS1ZOCML7NS
 --secret=A+ecBz2+Sdxj4Y8Mo+u3akIewGvJPkwOhwRgPKkX --protocol=http
 --uri_style=path write
 host=localhost
  Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds.
  Object prefix: benchmark_data_l3_4032
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   3 3 0 0 0 - 0
  1  1616 0 0 0 - 0
  2  1616 0 0 0 - 0
  3  1616 0 0 0 - 0
  4  1616 0 0 0 - 0
  5  1616 0 0 0 - 0
  6  1616 0 0 0 - 0
  7  1616 0 0 0 - 0
  8  1616 0 0 0 - 0
  9  1616 0 0 0 - 0
 10  1616 0 0 0 - 0
 11  1616 0 0 0 - 0
 12  1617 1  0.333265  0.33   11.2761   11.2761
 13  1618 2  0.615257 4   12.5964   11.9363
 14  1620 4   1.14262 8   13.1392   12.5365
 15  1623 7   1.8662812   14.2273   13.2594
 16  162711   2.7494416   15.0222   13.8968
 17  163216   3.7639420   16.2604   14.6301
 18  163216   3.55483 0 -   14.6301
 19  1634183.7887 46.2274   13.7695
 2013-01-27 20:49:29.703509min lat: 6.2274 max lat: 16.2604 avg lat: 13.7695
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 20  163418   3.59927 0 -   13.7695
 21  163418   3.42787 0 -   13.7695
 22  163418   3.27205 0 -   13.7695
 23  163519   3.30367 1   9.09053   13.5233
 24  163620   3.33264 4   9.0966713.302
 25  163620   3.19933 0 -13.302
 26  163620   3.07628 0 -13.302
 27  163721   3.11047   1.3   11.245913.204
 28  163721   2.99938 0 -13.204
 29  163721   2.89595 0 -13.204
 30  163721   2.79942 0 -13.204
 31  163721   2.70912 0 -13.204
 32  163923   2.87441   1.6   14.9981   13.3602
 33  1639232.7873 0 -   13.3602
 34  163923   2.70533 0 -   13.3602
 35  164024   2.74229   1.3   21.5365   13.7009
 36  164024   2.66612 0 -   13.7009
 37  164226   2.81023 4   22.6059   14.3855
 38  164226   2.73628 0 -   14.3855
 39  164529   2.97374 6   23.2615   15.3025
 2013-01-27 20:49:49.707740min lat: 6.2274 max lat: 23.4496 avg lat: 16.1307
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 40  165135   3.4992724   21.0123   16.1307
 41  165135   3.41392 0 -   16.1307
 42  165236   3.42786 2   19.024316.211
 43  165236   3.34814 0 -16.211
 44  165236   3.27204 0 -16.211
 45  165236   3.19933 0 -16.211
 46  165337   3.21672 1   11.0793   16.0723
 47  165337   

Re: [PATCH 0/3] fix some rbd-fuse related issues

2013-01-28 Thread Dan Mick
Thanks Danny, I'll look at these today. 

On Jan 28, 2013, at 7:33 AM, Danny Al-Gaaf danny.al-g...@bisect.de wrote:

 Here three patches to fix some issues with the new rbd-fuse
 code and an issues with the fuse handling in configure.
 
 Danny Al-Gaaf (3):
  configure: fix check for fuse_getgroups()
  rbd-fuse: fix usage of conn-want
  rbd-fuse: fix printf format for off_t and size_t
 
 configure.ac|  8 
 src/rbd_fuse/rbd-fuse.c | 12 +++-
 2 files changed, 11 insertions(+), 9 deletions(-)
 
 -- 
 1.8.1.1
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't download from radosgw

2013-01-28 Thread Yehuda Sadeh
On Mon, Jan 28, 2013 at 3:55 AM, Gandalf Corvotempesta
gandalf.corvotempe...@gmail.com wrote:
 2013/1/28 Gandalf Corvotempesta gandalf.corvotempe...@gmail.com:
 2013-01-28 12:22:27.759162 7fe8657c3700  0 NOTICE: failed to send
 response to client
 2013-01-28 12:22:27.759186 7fe8657c3700  0 ERROR: s-cio-print()
 returned err=-1
 2013-01-28 12:22:27.759206 7fe8657c3700  0 ERROR: s-cio-print()
 returned err=-1
 2013-01-28 12:22:27.759211 7fe8657c3700  0 ERROR: s-cio-print()
 returned err=-1
 2013-01-28 12:22:27.759216 7fe8657c3700  0 ERROR: s-cio-print()
 returned err=-1
 2013-01-28 12:22:27.759268 7fe8657c3700  2 req 128:0.051384:s3:GET
 /public2/shared/9780470398661.pdf:get_obj:http status=403
 2013-01-28 12:22:27.759336 7fe8657c3700  1 == req done
 req=0x3192980 http_status=403 ==


 This happens only with Google Chrome.
 Firefox, curl, wget and many other are able to download properly.

(resending to all)

It looks like the connection is closed early by the client (chrome).
Just a thought, maybe the content-type is not set correctly on the
object?

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RadosGW performance and disk space usage

2013-01-28 Thread Cesar Mello
Sure I can later when I arrive home. With the end of my vacation, I'll
be able to devote a couple of hours after my 3-year-old sleeps. :-)

I guess my laptop hard disk has horrible seek times. I'll repeat the
tests in my desktop as soon as possible.

Thanks a lot for the attention!

Best regards
Mello

On Mon, Jan 28, 2013 at 3:35 PM, Yehuda Sadeh yeh...@inktank.com wrote:
 On Sun, Jan 27, 2013 at 3:10 PM, Cesar Mello cme...@gmail.com wrote:
 Hi,

 Just tried rest-bench. This little tool is wonderful, thanks!

 I still have to learn lots of things. So please don't spend much time
 explaining me, but instead please give me any pointers to
 documentation or source code that can be useful. As a curiosity, I'm
 pasting the results from my laptop. I'll repeat the same tests using
 my desktop as the server.

 Notice there is an assert being triggered, so I guess I'm running a
 build with debugging code ?!. I compiled using ./configure
 --with-radosgw --with-rest-bench followed by make.

 Thanks a lot for the attention.

 Best regards!
 Mello

 root@l3:/etc/init.d# rest-bench --api-host=localhost --bucket=test
 --access-key=JJABVJ3AWBS1ZOCML7NS
 --secret=A+ecBz2+Sdxj4Y8Mo+u3akIewGvJPkwOhwRgPKkX --protocol=http
 --uri_style=path write
 host=localhost
  Maintaining 16 concurrent writes of 4194304 bytes for at least 60 seconds.
  Object prefix: benchmark_data_l3_4032
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   3 3 0 0 0 - 0
  1  1616 0 0 0 - 0
  2  1616 0 0 0 - 0
  3  1616 0 0 0 - 0
  4  1616 0 0 0 - 0
  5  1616 0 0 0 - 0
  6  1616 0 0 0 - 0
  7  1616 0 0 0 - 0
  8  1616 0 0 0 - 0
  9  1616 0 0 0 - 0
 10  1616 0 0 0 - 0
 11  1616 0 0 0 - 0
 12  1617 1  0.333265  0.33   11.2761   11.2761
 13  1618 2  0.615257 4   12.5964   11.9363
 14  1620 4   1.14262 8   13.1392   12.5365
 15  1623 7   1.8662812   14.2273   13.2594
 16  162711   2.7494416   15.0222   13.8968
 17  163216   3.7639420   16.2604   14.6301
 18  163216   3.55483 0 -   14.6301
 19  1634183.7887 46.2274   13.7695
 2013-01-27 20:49:29.703509min lat: 6.2274 max lat: 16.2604 avg lat: 13.7695
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 20  163418   3.59927 0 -   13.7695
 21  163418   3.42787 0 -   13.7695
 22  163418   3.27205 0 -   13.7695
 23  163519   3.30367 1   9.09053   13.5233
 24  163620   3.33264 4   9.0966713.302
 25  163620   3.19933 0 -13.302
 26  163620   3.07628 0 -13.302
 27  163721   3.11047   1.3   11.245913.204
 28  163721   2.99938 0 -13.204
 29  163721   2.89595 0 -13.204
 30  163721   2.79942 0 -13.204
 31  163721   2.70912 0 -13.204
 32  163923   2.87441   1.6   14.9981   13.3602
 33  1639232.7873 0 -   13.3602
 34  163923   2.70533 0 -   13.3602
 35  164024   2.74229   1.3   21.5365   13.7009
 36  164024   2.66612 0 -   13.7009
 37  164226   2.81023 4   22.6059   14.3855
 38  164226   2.73628 0 -   14.3855
 39  164529   2.97374 6   23.2615   15.3025
 2013-01-27 20:49:49.707740min lat: 6.2274 max lat: 23.4496 avg lat: 16.1307
sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 40  165135   3.4992724   21.0123   16.1307
 41  165135   3.41392 0 -   16.1307
 42  165236   

Re: Geo-replication with RADOS GW

2013-01-28 Thread Gregory Farnum
On Monday, January 28, 2013 at 9:54 AM, Ben Rowland wrote:
 Hi,
  
 I'm considering using Ceph to create a cluster across several data
 centres, with the strict requirement that writes should go to both
 DCs. This seems possible by specifying rules in the CRUSH map, with
 an understood latency hit resulting from purely synchronous writes.
  
 The part I'm unsure about is how the RADOS GW fits into this picture.
 For high availability (and to improve best-case latency on reads),
 we'd want to run a gateway in each data centre. However, the first
 paragraph of the following post suggests this is not possible:
  
 http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12238
  
 Is there a hard restriction on how many radosgw instances can run
 across the cluster, or is the point of the above post more about a
 performance hit?

It's talking about the performance hit. Most people can't afford data-center 
level connectivity between two different buildings. ;) If you did have a Ceph 
cluster split across two DC (with the bandwidth to support them) this will work 
fine. There aren't any strict limits on the number of gateways you stick on a 
cluster, just the scaling costs associated with cache invalidation 
notifications.

  
 It seems to me it should be possible to run more
 than one radosgw, particularly if each instance communicates with a
 local OSD which can proxy reads/writes to the primary (which may or
 may not be DC-local).

They aren't going to do this, though — each gateway will communicate with the 
primaries directly.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] fix some compiler warnings

2013-01-28 Thread Dan Mick
I'd just noticed utime on my laptop 32-bit build and was trying to figure out 
why our 32-bit nightly didn't see it. And Greg had seen the system build 
problem where I didn't, and I was isolating differences there as well. 

I purposely didn't spend time on the system() error handling because I was 
thinking of those calls as best-effort, if they fail the map will likely fail 
anyway, but there's no harm in handling errors, particularly if it'll shit the 
compiler up :)

On Jan 27, 2013, at 12:57 PM, Danny Al-Gaaf danny.al-g...@bisect.de wrote:

 Attached two patches to fix some compiler warnings.
 
 Danny Al-Gaaf (2):
  utime: fix narrowing conversion compiler warning in sleep()
  rbd: don't ignore return value of system()
 
 src/include/utime.h |  2 +-
 src/rbd.cc  | 36 ++--
 2 files changed, 31 insertions(+), 7 deletions(-)
 
 -- 
 1.8.1.1
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-01-28 Thread Isaac Otsiabah


Gregory, i recreated the osd down problem again this morning on two nodes 
(g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) 
and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute and 
half after adding osd 3, 4, 5 were adde4d. i have included the routing table of 
each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log files are 
attached. The crush map was default. Also, it could be a timing issue because 
it does not always fail when  using default crush map, it takes several trials 
before you see it. Thank you.


[root@g13ct ~]# netstat -r
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
default 133.164.98.250 0.0.0.0 UG    0 0  0 eth2
133.164.98.0    *   255.255.255.0   U 0 0  0 eth2
link-local  *   255.255.0.0 U 0 0  0 eth3
link-local  *   255.255.0.0 U 0 0  0 eth0
link-local  *   255.255.0.0 U 0 0  0 eth2
192.0.0.0   *   255.0.0.0   U 0 0  0 eth3
192.0.0.0   *   255.0.0.0   U 0 0  0 eth0
192.168.0.0 *   255.255.255.0   U 0 0  0 eth3
192.168.1.0 *   255.255.255.0   U 0 0  0 eth0
[root@g13ct ~]# ceph osd tree

# id    weight  type name   up/down reweight
-1  6   root default
-3  6   rack unknownrack
-2  3   host g13ct
0   1   osd.0   up  1
1   1   osd.1   down    1
2   1   osd.2   up  1
-4  3   host g14ct
3   1   osd.3   up  1
4   1   osd.4   up  1
5   1   osd.5   up  1



[root@g14ct ~]# ceph osd tree

# id    weight  type name   up/down reweight
-1  6   root default
-3  6   rack unknownrack
-2  3   host g13ct
0   1   osd.0   up  1
1   1   osd.1   down    1
2   1   osd.2   up  1
-4  3   host g14ct
3   1   osd.3   up  1
4   1   osd.4   up  1
5   1   osd.5   up  1

[root@g14ct ~]# netstat -r
Kernel IP routing table
Destination Gateway Genmask Flags   MSS Window  irtt Iface
default 133.164.98.250 0.0.0.0 UG    0 0  0 eth0
133.164.98.0    *   255.255.255.0   U 0 0  0 eth0
link-local  *   255.255.0.0 U 0 0  0 eth3
link-local  *   255.255.0.0 U 0 0  0 eth5
link-local  *   255.255.0.0 U 0 0  0 eth0
192.0.0.0   *   255.0.0.0   U 0 0  0 eth3
192.0.0.0   *   255.0.0.0   U 0 0  0 eth5
192.168.0.0 *   255.255.255.0   U 0 0  0 eth3
192.168.1.0 *   255.255.255.0   U 0 0  0 eth5
[root@g14ct ~]# ceph osd tree

# id    weight  type name   up/down reweight
-1  6   root default
-3  6   rack unknownrack
-2  3   host g13ct
0   1   osd.0   up  1
1   1   osd.1   down    1
2   1   osd.2   up  1
-4  3   host g14ct
3   1   osd.3   up  1
4   1   osd.4   up  1
5   1   osd.5   up  1





Isaac










- Original Message -
From: Isaac Otsiabah zmoo...@yahoo.com
To: Gregory Farnum g...@inktank.com
Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
Sent: Friday, January 25, 2013 9:51 AM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to 
my cluster



Gregory, the network physical layout is simple, the two networks are 
separate. the 192.168.0 and the 192.168.1 are not subnets within a 
network.

Isaac  




- Original Message -
From: Gregory Farnum g...@inktank.com
To: Isaac Otsiabah zmoo...@yahoo.com
Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
Sent: Thursday, January 24, 2013 1:28 PM
Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to 
my cluster

What's the physical layout of your networking? This additional log may prove 
helpful as well, but I really need a bit more context in evaluating the 
messages I see from the first one. :) 
-Greg


On Thursday, January 24, 

Re: [PATCH 07/25] mds: don't early reply rename

2013-01-28 Thread Sage Weil
On Wed, 23 Jan 2013, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com
 
 _rename_finish() does not send dentry link/unlink message to replicas.
 We should prevent dentries that are modified by the rename operation
 from getting new replicas when the rename operation is committing. So
 don't mark xlocks done and early reply for rename

Can we change this to only skip early reply if there are replicas?  Or 
change things so we do send thos messages (or something isilar) early?  As 
is this will kill workloads like rsync that rename every file.

Thanks!
s

 
 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Server.cc | 8 
  1 file changed, 8 insertions(+)
 
 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index eced76f..4492341 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -796,6 +796,14 @@ void Server::early_reply(MDRequest *mdr, CInode *tracei, 
 CDentry *tracedn)
  return;
}
  
 +  // _rename_finish() does not send dentry link/unlink message to replicas.
 +  // so do not mark xlocks done, the xlocks prevent srcdn and destdn from
 +  // getting new replica.
 +  if (mdr-client_request-get_op() == CEPH_MDS_OP_RENAME) {
 +dout(10)  early_reply - rename, not allowed  dendl;
 +return;
 +  }
 +
MClientRequest *req = mdr-client_request;
entity_inst_t client_inst = req-get_source_inst();
if (client_inst.name.is_mds())
 -- 
 1.7.11.7
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] rbd: manage racing opens/removes

2013-01-28 Thread Alex Elder
A recent change to rbd prevented rbd devices from being unmapped
when they were in use.  However that change did not address a
different, but related problem.  It is possible for an open (the
one that would bump the open count from 0 to 1) to begin after
a request to remove the rbd device has decided it can proceed.

To fix this, define a new removing flag to prevent opens from
proceeding once ermoval of a device has begun.  The first patch
in this series defines a new flags field, and uses it for this
as well as the exists flag for snapshot mappings.

-Alex

[PATCH 1/2] rbd: define flags field, use it for exists flag
[PATCH 2/2] rbd: prevent open for image being removed
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] rbd: define flags field, use it for exists flag

2013-01-28 Thread Alex Elder
Define a new rbd device flags field, manipulated using bit
operations.  Replace the use of the current exists flag with a bit
in this new flags field.  Add a little commentary about the
exists flag, which does not need to be manipulated atomically.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 177ba0c..107df40 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -262,7 +262,7 @@ struct rbd_device {
spinlock_t  lock;   /* queue lock */

struct rbd_image_header header;
-   atomic_texists;
+   unsigned long   flags;
struct rbd_spec *spec;

char*header_name;
@@ -291,6 +291,12 @@ struct rbd_device {
unsigned long   open_count;
 };

+/* Flag bits for rbd_dev-flags */
+
+enum rbd_dev_flags {
+   rbd_dev_flag_exists,/* mapped snapshot has not been deleted */
+};
+
 static DEFINE_MUTEX(ctl_mutex);  /* Serialize 
open/close/setup/teardown */

 static LIST_HEAD(rbd_dev_list);/* devices */
@@ -790,7 +796,8 @@ static int rbd_dev_set_mapping(struct rbd_device
*rbd_dev)
goto done;
rbd_dev-mapping.read_only = true;
}
-   atomic_set(rbd_dev-exists, 1);
+   set_bit(rbd_dev_flag_exists, rbd_dev-flags);
+
 done:
return ret;
 }
@@ -1886,9 +1893,14 @@ static void rbd_request_fn(struct request_queue *q)
rbd_assert(rbd_dev-spec-snap_id == CEPH_NOSNAP);
}

-   /* Quit early if the snapshot has disappeared */
-
-   if (!atomic_read(rbd_dev-exists)) {
+   /*
+* Quit early if the mapped snapshot no longer
+* exists.  It's still possible the snapshot will
+* have disappeared by the time our request arrives
+* at the osd, but there's no sense in sending it if
+* we already know.
+*/
+   if (!test_bit(rbd_dev_flag_exists, rbd_dev-flags)) {
dout(request for non-existent snapshot);
rbd_assert(rbd_dev-spec-snap_id != CEPH_NOSNAP);
result = -ENXIO;
@@ -2578,7 +2590,7 @@ struct rbd_device *rbd_dev_create(struct
rbd_client *rbdc,
return NULL;

spin_lock_init(rbd_dev-lock);
-   atomic_set(rbd_dev-exists, 0);
+   rbd_dev-flags = 0;
INIT_LIST_HEAD(rbd_dev-node);
INIT_LIST_HEAD(rbd_dev-snaps);
init_rwsem(rbd_dev-header_rwsem);
@@ -3207,10 +3219,17 @@ static int rbd_dev_snaps_update(struct
rbd_device *rbd_dev)
if (snap_id == CEPH_NOSNAP || (snap  snap-id  snap_id)) {
struct list_head *next = links-next;

-   /* Existing snapshot not in the new snap context */
-
+   /*
+* A previously-existing snapshot is not in
+* the new snap context.
+*
+* If the now missing snapshot is the one the
+* image is mapped to, clear its exists flag
+* so we can avoid sending any more requests
+* to it.
+*/
if (rbd_dev-spec-snap_id == snap-id)
-   atomic_set(rbd_dev-exists, 0);
+   clear_bit(rbd_dev_flag_exists, rbd_dev-flags);
rbd_remove_snap_dev(snap);
dout(%ssnap id %llu has been removed\n,
rbd_dev-spec-snap_id == snap-id ?
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] rbd: prevent open for image being removed

2013-01-28 Thread Alex Elder
An open request for a mapped rbd image can arrive while removal of
that mapping is underway.  We need to prevent such an open request
from succeeding.  (It appears that Maciej Galkiewicz ran into this
problem.)

Define and use a removing flag to indicate a mapping is getting
removed.  Set it in the remove path after verifying nothing holds
the device open.  And check it in the open path before allowing the
open to proceed.  Acquire the rbd device's lock around each of these
spots to avoid any races accessing the flags and open_count fields.

This addresses:
http://tracker.newdream.net/issues/3427

Reported-by:  Maciej Galkiewicz maciejgalkiew...@ragnarson.com
Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   42 +-
 1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 107df40..03b15b8 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -259,10 +259,10 @@ struct rbd_device {

charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */

-   spinlock_t  lock;   /* queue lock */
+   spinlock_t  lock;   /* queue, flags, open_count */

struct rbd_image_header header;
-   unsigned long   flags;
+   unsigned long   flags;  /* possibly lock protected */
struct rbd_spec *spec;

char*header_name;
@@ -288,13 +288,20 @@ struct rbd_device {

/* sysfs related */
struct device   dev;
-   unsigned long   open_count;
+   unsigned long   open_count; /* protected by lock */
 };

-/* Flag bits for rbd_dev-flags */
+/*
+ * Flag bits for rbd_dev-flags.  If atomicity is required,
+ * rbd_dev-lock is used to protect access.
+ *
+ * Currently, only the removing flag (which is coupled with the
+ * open_count field) requires atomic access.
+ */

 enum rbd_dev_flags {
rbd_dev_flag_exists,/* mapped snapshot has not been deleted */
+   rbd_dev_flag_removing,  /* this mapping is being removed */
 };

 static DEFINE_MUTEX(ctl_mutex);  /* Serialize 
open/close/setup/teardown */
@@ -383,14 +390,23 @@ static int rbd_dev_v2_refresh(struct rbd_device
*rbd_dev, u64 *hver);
 static int rbd_open(struct block_device *bdev, fmode_t mode)
 {
struct rbd_device *rbd_dev = bdev-bd_disk-private_data;
+   bool removing = false;

if ((mode  FMODE_WRITE)  rbd_dev-mapping.read_only)
return -EROFS;

+   spin_lock(rbd_dev-lock);
+   if (test_bit(rbd_dev_flag_removing, rbd_dev-flags))
+   removing = true;
+   else
+   rbd_dev-open_count++;
+   spin_unlock(rbd_dev-lock);
+   if (removing)
+   return -ENOENT;
+
mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING);
(void) get_device(rbd_dev-dev);
set_device_ro(bdev, rbd_dev-mapping.read_only);
-   rbd_dev-open_count++;
mutex_unlock(ctl_mutex);

return 0;
@@ -399,10 +415,14 @@ static int rbd_open(struct block_device *bdev,
fmode_t mode)
 static int rbd_release(struct gendisk *disk, fmode_t mode)
 {
struct rbd_device *rbd_dev = disk-private_data;
+   unsigned long open_count_before;
+
+   spin_lock(rbd_dev-lock);
+   open_count_before = rbd_dev-open_count--;
+   spin_unlock(rbd_dev-lock);
+   rbd_assert(open_count_before  0);

mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING);
-   rbd_assert(rbd_dev-open_count  0);
-   rbd_dev-open_count--;
put_device(rbd_dev-dev);
mutex_unlock(ctl_mutex);

@@ -4135,10 +4155,14 @@ static ssize_t rbd_remove(struct bus_type *bus,
goto done;
}

-   if (rbd_dev-open_count) {
+   spin_lock(rbd_dev-lock);
+   if (rbd_dev-open_count)
ret = -EBUSY;
+   else
+   set_bit(rbd_dev_flag_removing, rbd_dev-flags);
+   spin_unlock(rbd_dev-lock);
+   if (ret  0)
goto done;
-   }

while (rbd_dev-parent_spec) {
struct rbd_device *first = rbd_dev;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] fix some rbd-fuse related issues

2013-01-28 Thread Dan Mick

Actually Sage merged them into master.  Thanks again.

On 01/28/2013 09:45 AM, Dan Mick wrote:

Thanks Danny, I'll look at these today.

On Jan 28, 2013, at 7:33 AM, Danny Al-Gaaf danny.al-g...@bisect.de wrote:


Here three patches to fix some issues with the new rbd-fuse
code and an issues with the fuse handling in configure.

Danny Al-Gaaf (3):
  configure: fix check for fuse_getgroups()
  rbd-fuse: fix usage of conn-want
  rbd-fuse: fix printf format for off_t and size_t

configure.ac|  8 
src/rbd_fuse/rbd-fuse.c | 12 +++-
2 files changed, 11 insertions(+), 9 deletions(-)

--
1.8.1.1


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] fix some compiler warnings

2013-01-28 Thread Dan Mick

Sage merged these into master.  Thanks!

On 01/27/2013 12:57 PM, Danny Al-Gaaf wrote:

Attached two patches to fix some compiler warnings.

Danny Al-Gaaf (2):
   utime: fix narrowing conversion compiler warning in sleep()
   rbd: don't ignore return value of system()

  src/include/utime.h |  2 +-
  src/rbd.cc  | 36 ++--
  2 files changed, 31 insertions(+), 7 deletions(-)


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/25] mds: don't early reply rename

2013-01-28 Thread Yan, Zheng
On 01/29/2013 05:44 AM, Sage Weil wrote:
 On Wed, 23 Jan 2013, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com

 _rename_finish() does not send dentry link/unlink message to replicas.
 We should prevent dentries that are modified by the rename operation
 from getting new replicas when the rename operation is committing. So
 don't mark xlocks done and early reply for rename
 
 Can we change this to only skip early reply if there are replicas?  Or 
 change things so we do send thos messages (or something isilar) early?  As 
 is this will kill workloads like rsync that rename every file.
 

How about not mark xlocks on dentries done.

Regards
Yan, Zheng


 Thanks!
 s
 

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  src/mds/Server.cc | 8 
  1 file changed, 8 insertions(+)

 diff --git a/src/mds/Server.cc b/src/mds/Server.cc
 index eced76f..4492341 100644
 --- a/src/mds/Server.cc
 +++ b/src/mds/Server.cc
 @@ -796,6 +796,14 @@ void Server::early_reply(MDRequest *mdr, CInode 
 *tracei, CDentry *tracedn)
  return;
}
  
 +  // _rename_finish() does not send dentry link/unlink message to replicas.
 +  // so do not mark xlocks done, the xlocks prevent srcdn and destdn from
 +  // getting new replica.
 +  if (mdr-client_request-get_op() == CEPH_MDS_OP_RENAME) {
 +dout(10)  early_reply - rename, not allowed  dendl;
 +return;
 +  }
 +
MClientRequest *req = mdr-client_request;
entity_inst_t client_inst = req-get_source_inst();
if (client_inst.name.is_mds())
 -- 
 1.7.11.7

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/25] mds: don't early reply rename

2013-01-28 Thread Yan, Zheng
On 01/29/2013 10:23 AM, Sage Weil wrote:
 On Tue, 29 Jan 2013, Yan, Zheng wrote:
 On 01/29/2013 05:44 AM, Sage Weil wrote:
 On Wed, 23 Jan 2013, Yan, Zheng wrote:
 From: Yan, Zheng zheng.z@intel.com

 _rename_finish() does not send dentry link/unlink message to replicas.
 We should prevent dentries that are modified by the rename operation
 from getting new replicas when the rename operation is committing. So
 don't mark xlocks done and early reply for rename

 Can we change this to only skip early reply if there are replicas?  Or 
 change things so we do send thos messages (or something isilar) early?  As 
 is this will kill workloads like rsync that rename every file.


 How about not mark xlocks on dentries done.
 
 Yeah, I like that if we do that just in the rename case.
 
 The other patches look okay to me (from a quick review).  With that change 
 I'd like to pull the whole branch in.  I assume your current wip-mds 
 branch include sthe fix or squashes the problem from the previous series?
 

Just force update my wip-mds branch. That patch is renamed to mds: don't set
xlocks on dentries done when early reply rename. 

I also updated mds: preserve non-auth/unlinked objects until slave commit and
mds: fix slave rename rollback. The new patches trim non-auth subtrees more
actively.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: Ceph Production Environment Setup and Configurations?

2013-01-28 Thread femi anjorin
Hi,

Please with regards to my questions on Ceph Production Environment ...
I like to give u these details.

i like to test a write, read and delete operation on ceph storage
cluster in a production environment.

i also like to check the self healing and managing functionalities.

i like to know in the production setup , are gateways required for any of the
three methods of accessing ceph cluster? or should the setup just be like
all the servers should be storage nodes with mon,mds and osd running
on each of them ...while i access these storage nodes through a single
computer which one can call a client just like you described in the 5
mins setup?


-- Forwarded message --
Date: Tue, Jan 29, 2013 at 2:56 AM
Subject: Ceph Production Environment Setup and Configurations?
To: ceph-devel@vger.kernel.org


Please can anyone  an advise  on how exactly a CEPH production
environment should look like? and what the configuration files should
be. My hardwares include the following:



Server A, B, C configuration

CPU - Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz

RAM - 16GB

Hard drive -  500GB

SSD - 120GB



Server D,E,F,G,H,J configuration

CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz

RAM - 4 GB

Boot drive -  320gb

SSD - 120 GB

Storage drives - 16 X 2 TB



I am thinking of these configurations but i am not sure.

Server A - MDS and MON

Server B - MON

Server C - MON

Server D, E,F,G,H,J - OSD



Regards.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] two small patches for CEPH wireshark plugin

2013-01-28 Thread David Zafman

You could look at the wip-wireshark-zafman branch.  I rebased it and force 
pushed it.   It has changes to the wireshark.patch and a minor change I needed 
to get it to build.  I'm surprised the recent checkin didn't include the change 
to packet-ceph.c which I needed to get it to build.

David Zafman
Senior Developer
david.zaf...@inktank.com



On Jan 24, 2013, at 12:49 PM, Danny Al-Gaaf danny.al-g...@bisect.de wrote:

 Am 24.01.2013 19:31, schrieb Sage Weil:
 Hi Danny!
 [...]
 Since you brought up wireshark...
 
 We would LOVE LOVE LOVE it if this plugin could get upstream into 
 wireshark.  
 
 Yes, this would be great.
 
 IIRC, the problem (last time we checked, ages ago) was that 
 there were strict coding guidelines for that project that weren't 
 followed.  I'm not sure if that is still the case, or even if that is 
 accurate.
 
 It would be great if someone on this list who is looking for a way to 
 contribute could take the lead on trying to make this happen... :-)
 
 I'll take a look at it maybe ... if I find some free time for it.
 
 What about the patches? Can we apply them to the ceph git tree until we
 have another solution for the wireshark code?
 
 Danny
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ceph] locking fun with d_materialise_unique()

2013-01-28 Thread Al Viro
There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling
in ceph_fill_trace().  Consider the following scenario:

Process calls stat(2).  Lookup locks parent, allocates dentry and calls
-lookup().  Request is created and sent over the wire.  Then we sit and
wait for completion.  Just as the reply has arrived, process gets SIGKILL.
OK, we get to
/*
 * ensure we aren't running concurrently with
 * ceph_fill_trace or ceph_readdir_prepopulate, which
 * rely on locks (dir mutex) held by our caller.
 */
mutex_lock(req-r_fill_mutex);
req-r_err = err;
req-r_aborted = true;
mutex_unlock(req-r_fill_mutex);
and we got there before handle_reply() grabbed -r_fill_mutex.  Then we return
to ceph_lookup(), drop the reference to request and bugger off.  Parent is
unlocked by caller.

In the meanwhile, there's another thread sitting in handle_reply().  It
got -r_fill_mutex and called ceph_fill_trace().  Had that been something
like rename request, ceph_fill_trace() would've checked req-r_aborted and
that would've been it.  However, we hit this:
} else if (req-r_op == CEPH_MDS_OP_LOOKUPSNAP ||
   req-r_op == CEPH_MDS_OP_MKSNAP) {
struct dentry *dn = req-r_dentry;
and proceed to
dout( linking snapped dir %p to dn %p\n, in, dn);
dn = splice_dentry(dn, in, NULL, true);
which does
realdn = d_materialise_unique(dn, in);
and we are in trouble - d_materialise_unique() assumes that -i_mutex on parent
is held, which isn't guaranteed anymore.  Not that d_delete() done a couple
of lines earlier was any better...

I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
get to its splice_dentry() and d_delete() calls in similar situations -
I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
there as well.

I might very well be missing something - that code is seriously convoluted,
and race wouldn't be easy to hit, so I don't have anything resembling
a candidate reproducer ;-/  IOW, this is just from RTFS and I'd really
appreciate comments from folks familiar with ceph.

VFS side of requirements is fairly simple:
* d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _),
d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called
only with -i_mutex held on the parent of d.
* d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _),
d_instantiate(d, _) should be called only with d being parentless (i.e.
d-d_parent == d, aka. IS_ROOT(d)) or with -i_mutex held on the parent of d.
* with the exception of prepopulate dentry tree at -get_sb() time
kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called
only with d-d_inode-i_mutex held (and it won't be too hard to get rid of
those exceptions, actually).
* lookup_one_len(_, d, _) should only be called with -i_mutex held
on d-d_inode
* d_move(d1, d2) in case when d1 and d2 have different parents
should only be called with -s_vfs_rename_mutex held on d1-d_sb (== d2-d_sb).

We are guaranteed that -i_mutex is held on (inode of) parent of d in
-lookup(_, d, _)
-atomic_open(_, d, _, _, _, _)
-mkdir(_, d, _)
-symlink(_, d, _)
-create(_, d, _, _)
-mknod(_, d, _, _)
-link(_, _, d)
-unlink(_, d)
-rmdir(_, d)
-rename(_, d, _, _)
-rename(_, _, _, d)
Note that this is *not* guaranteed for another argument of -link() - the
inode we are linking has -i_mutex held, but nothing of that kind is promised
for its parent directory.
We also are guaranteed that -i_mutex is held on the inode of opened
directory passed to -readdir() and on victims of -unlink(), -rmdir() and
overwriting -rename().

FWIW, I went through that stuff this weekend and we are fairly close to having
those requirements satisfied - I'll push a branch with the accumulated fixes
in a few and after that we should be down to very few remaining violations and
dubious places (ceph issues above being one of those).  And yes, this stuff
really need to be in Documentation/filesystems somewhere, along with the
full description of locking rules for -d_parent and -d_name accesses.  I'm
trying to put that together right now...
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-28 Thread Sage Weil
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:
   There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling
 in ceph_fill_trace().  Consider the following scenario:
 
 Process calls stat(2).  Lookup locks parent, allocates dentry and calls
 -lookup().  Request is created and sent over the wire.  Then we sit and
 wait for completion.  Just as the reply has arrived, process gets SIGKILL.
 OK, we get to
 /*
  * ensure we aren't running concurrently with
  * ceph_fill_trace or ceph_readdir_prepopulate, which
  * rely on locks (dir mutex) held by our caller.
  */
 mutex_lock(req-r_fill_mutex);
 req-r_err = err;
 req-r_aborted = true;
 mutex_unlock(req-r_fill_mutex);
 and we got there before handle_reply() grabbed -r_fill_mutex.  Then we return
 to ceph_lookup(), drop the reference to request and bugger off.  Parent is
 unlocked by caller.
 
 In the meanwhile, there's another thread sitting in handle_reply().  It
 got -r_fill_mutex and called ceph_fill_trace().  Had that been something
 like rename request, ceph_fill_trace() would've checked req-r_aborted and
 that would've been it.  However, we hit this:
 } else if (req-r_op == CEPH_MDS_OP_LOOKUPSNAP ||
req-r_op == CEPH_MDS_OP_MKSNAP) {
 struct dentry *dn = req-r_dentry;
 and proceed to
 dout( linking snapped dir %p to dn %p\n, in, dn);
 dn = splice_dentry(dn, in, NULL, true);
 which does
 realdn = d_materialise_unique(dn, in);
 and we are in trouble - d_materialise_unique() assumes that -i_mutex on 
 parent
 is held, which isn't guaranteed anymore.  Not that d_delete() done a couple
 of lines earlier was any better...

Yep, that is indeed a problem.  I think we just need to do the r_aborted 
and/or r_locked_dir check in the else if condition...

 I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
 get to its splice_dentry() and d_delete() calls in similar situations -
 I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
 there as well.

...and the condition guarding readdir_prepopulate().  :)

 I might very well be missing something - that code is seriously convoluted,
 and race wouldn't be easy to hit, so I don't have anything resembling
 a candidate reproducer ;-/  IOW, this is just from RTFS and I'd really
 appreciate comments from folks familiar with ceph.

I think you're reading it correctly.  The main thing to keep in mind here 
is that we *do* need to call fill_inode() for the inode metadata on these 
requests to keep the mds and client state in sync.  The dentry state is 
safe to ignore.

It would be great to have the dir i_mutex rules summarized somewhere, even 
if it is just a copy of the below.  It took a fair bit of trial and error 
to infer what was going on when writing this code.  :)

Ping me when you've pushed that branch and I'll take a look...

Thanks!
sage





 
 VFS side of requirements is fairly simple:
   * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _),
 d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called
 only with -i_mutex held on the parent of d.
   * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _),
 d_instantiate(d, _) should be called only with d being parentless (i.e.
 d-d_parent == d, aka. IS_ROOT(d)) or with -i_mutex held on the parent of d.
   * with the exception of prepopulate dentry tree at -get_sb() time
 kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called
 only with d-d_inode-i_mutex held (and it won't be too hard to get rid of
 those exceptions, actually).
   * lookup_one_len(_, d, _) should only be called with -i_mutex held
 on d-d_inode
   * d_move(d1, d2) in case when d1 and d2 have different parents
 should only be called with -s_vfs_rename_mutex held on d1-d_sb (== 
 d2-d_sb).
 
 We are guaranteed that -i_mutex is held on (inode of) parent of d in
 -lookup(_, d, _)
 -atomic_open(_, d, _, _, _, _)
 -mkdir(_, d, _)
 -symlink(_, d, _)
 -create(_, d, _, _)
 -mknod(_, d, _, _)
 -link(_, _, d)
 -unlink(_, d)
 -rmdir(_, d)
 -rename(_, d, _, _)
 -rename(_, _, _, d)
 Note that this is *not* guaranteed for another argument of -link() - the
 inode we are linking has -i_mutex held, but nothing of that kind is promised
 for its parent directory.
 We also are guaranteed that -i_mutex is held on the inode of opened
 directory passed to -readdir() and on victims of -unlink(), -rmdir() and
 overwriting -rename().
 
 FWIW, I went through that stuff this weekend and we are fairly close to having
 those requirements satisfied - I'll push a branch with the accumulated fixes
 in a few and after that we should be down to very few remaining violations and
 dubious places (ceph issues above being one