RE: format 2TB rbd device is too slow

2015-08-30 Thread Ma, Jianpeng
Ilya
 I modify kernel code. The patch like this:
[root@dev linux]# git diff
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bc67a93..e4c4ea9 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -55,7 +55,7 @@
  * universally 512 bytes.  These symbols are just slightly more
  * meaningful than the bare numbers they represent.
  */
-#defineSECTOR_SHIFT9
+#defineSECTOR_SHIFT12
 #defineSECTOR_SIZE (1ULL  SECTOR_SHIFT)

 /*
@@ -3811,6 +3811,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
blk_queue_io_min(q, segment_size);
blk_queue_io_opt(q, segment_size);

+   blk_queue_physical_block_size(q, SECTOR_SIZE);
/* enable the discard support */
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
q-limits.discard_granularity = segment_size;

I don't know the exact why this can decrease the time.

For the -b of mkfs.xfs, I try for 65536. But it don't change any.


Thanks!

 -Original Message-
 From: Ilya Dryomov [mailto:idryo...@gmail.com]
 Sent: Friday, August 28, 2015 5:23 PM
 To: Ma, Jianpeng
 Cc: huang jun; Haomai Wang; ceph-devel
 Subject: Re: format 2TB rbd device is too slow
 
 On Fri, Aug 28, 2015 at 10:36 AM, Ma, Jianpeng jianpeng...@intel.com
 wrote:
  Hi Ilya,
 We can change sector size from 512 to 4096. This can reduce the count
 of write.
  I did a simple test: for 900G, mkfs.xfs -f For default: 1m10s Physical
  sector size = 4096:  0m10s.
 
  But if change sector size, we need rbd meta record this.
 
 What exactly do you mean by changing the sector size?  xfs sector size or
 physical block size of the rbd device?  If the latter, meaning mkfs.xfs -s 
 size
 4096, I fail to see how it can result in the above numbers.  If the former, 
 you
 have to patch the kernel and I fail to see how it can result in such an
 improvement too.  I probably didn't have enough coffee, please elaborate.
 
 Thanks,
 
 Ilya
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: format 2TB rbd device is too slow

2015-08-28 Thread Ma, Jianpeng
Hi Ilya, 
   We can change sector size from 512 to 4096. This can reduce the count of 
write.
I did a simple test: for 900G, mkfs.xfs -f
For default: 1m10s
Physical sector size = 4096:  0m10s.

But if change sector size, we need rbd meta record this.

Thanks!
Jianpeng
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of huang jun
 Sent: Thursday, August 27, 2015 8:44 AM
 To: Ilya Dryomov
 Cc: Haomai Wang; ceph-devel
 Subject: Re: format 2TB rbd device is too slow
 
 hi,llya
 
 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com:
  On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com
 wrote:
  On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com
 wrote:
  hi,all
  we create a 2TB rbd image, after map it to local, then we format it
  to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish,
  but  local physical disk with the same size just need 6 seconds.
 
 
  I think librbd have two PR related to this.
 
  After debug, we found there are two steps in rbd module during formating:
  a) send  233093 DELETE requests to osds(number_of_requests = 2TB /
 4MB),
 this step spent almost 92 seconds.
 
  I guess this(https://github.com/ceph/ceph/pull/4221/files) may help
 
  It's submitting deletes for non-existent objects, not zeroing.  The
  only thing that will really help here is the addition of rbd object
  map support to the kernel client.  That could happen in 4.4, but 4.5
  is a safer bet.
 
 
  b) send 4238 messages like this: [set-alloc-hint object_size 4194304
  write_size 4194304,write 0~512] to osds, that spent 227 seconds.
 
  I think kernel rbd also need to use
  https://github.com/ceph/ceph/pull/4983/files
 
  set-alloc-hint may be a problem, but I think a bigger problem is the
  size of the write.  Are all those writes 512 bytes long?
 
 In another test to format 2TB rbd device, there are :
 2 messages,each write 131072 bytes
 4000 messages, each write 262144 bytes
 112 messages, each write 4096 bytes
 194 messages, each write 512 bytes
 
 the xfs info:
 meta-data = /dev/rbd/rbd/test2tisize=256agcount=33,
 agsize=16382976 blks
=sectsz=512
 attr=2,
 projid32bit=1
=crc=0
 data =   bsize=4096
 blocks=524288000, imaxpct=5
=   sunit=1024
 swidth=1024 blks
 naming=version 2   bsize=4096   ascii-ci=0
 log  =internal log bsize=4096   blocks=256000,
 version=2
   = sectsz=512
 sunit=8
 blks, lazy-count=1
 realtime  =none  extsz=4096   blocks=0,
 rtextents=0
 
  Thanks,
 
  Ilya
 
 
 
 --
 thanks
 huangjun
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html


RE: [PATCH] fs/ceph: No need get inode of parent in ceph_open.

2015-08-18 Thread Ma, Jianpeng
Thanks very much zheng. I'll pay more attention to the format of patch.

Jianpeng


 -Original Message-
 From: Yan, Zheng [mailto:z...@redhat.com]
 Sent: Tuesday, August 18, 2015 10:36 AM
 To: Ma, Jianpeng
 Cc: Sage Weil; ceph-devel@vger.kernel.org
 Subject: Re: [PATCH] fs/ceph: No need get inode of parent in ceph_open.
 
 
  On Aug 17, 2015, at 16:43, Ma, Jianpeng jianpeng...@intel.com wrote:
 
  For ceph_open, it already get the parent inode. So no need to get again.
 
  Signed-off-by: Jianpeng Ma jianpeng...@intel.com
  ---
  fs/ceph/file.c | 3 ---
  1 file changed, 3 deletions(-)
 
  diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..5ecd3dc
  100644
  --- a/fs/ceph/file.c
  +++ b/fs/ceph/file.c
  @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
 ihold(inode);
 
 req-r_num_caps = 1;
  -   if (flags  O_CREAT)
  -   parent_inode =
 ceph_get_dentry_parent_inode(file-f_path.dentry);
 err = ceph_mdsc_do_request(mdsc, parent_inode, req);
  -   iput(parent_inode);
 if (!err)
 err = ceph_init_file(inode, file, req-r_fmode);
 ceph_mdsc_put_request(req);
  —
 
 I fixed your patches (all tabs are replaced by spaces in your patches)  and
 added them to our testing branch
 
 Thanks
 Yan, Zheng
 
  2.4.3
 



[PATCH] fs/ceph: No need get inode of parent in ceph_open.

2015-08-17 Thread Ma, Jianpeng
For ceph_open, it already get the parent inode. So no need to get again.

Signed-off-by: Jianpeng Ma jianpeng...@intel.com
---
 fs/ceph/file.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8b79d87..5ecd3dc 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
ihold(inode);

req-r_num_caps = 1;
-   if (flags  O_CREAT)
-   parent_inode = 
ceph_get_dentry_parent_inode(file-f_path.dentry);
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
-   iput(parent_inode);
if (!err)
err = ceph_init_file(inode, file, req-r_fmode);
ceph_mdsc_put_request(req);
--
2.4.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs/ceph: cleanup code.

2015-08-17 Thread Ma, Jianpeng

Signed-off-by: Jianpeng Ma jianpeng...@intel.com
---
 fs/ceph/mds_client.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 6aa07af..0c65d77 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2520,8 +2520,7 @@ out_err:
if (err) {
req-r_err = err;
} else {
-   req-r_reply = msg;
-   ceph_msg_get(msg);
+   req-r_reply = ceph_msg_get(msg);
req-r_got_result = true;
}
} else {
--
2.4.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH ] fs/ceph: Remove the useless judgement.

2015-08-17 Thread Ma, Jianpeng

For ret != 0, it already handle. So skip this.

Signed-off-by: Jianpeng Ma jianpeng...@intel.com
---
 fs/ceph/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5ecd3dc..17cc357 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -276,7 +276,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry 
*dentry,
if (err)
goto out_req;

-   if (err == 0  (flags  O_CREAT)  !req-r_reply_info.head-is_dentry)
+   if ((flags  O_CREAT)  !req-r_reply_info.head-is_dentry)
err = ceph_handle_notrace_create(dir, dentry);

if (d_unhashed(dentry)) {
--
2.4.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: Using file-f_flags rather than flags check O_CREAT.

2015-08-13 Thread Ma, Jianpeng
Because flags removed the O_CREAT, so we should use file-f_flags.

Signed-off-by: Jianpeng Ma jianpeng...@intel.com
---
 fs/ceph/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8b79d87..e1347cf 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -210,7 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
ihold(inode);

req-r_num_caps = 1;
-   if (flags  O_CREAT)
+   if (file-f_flags  O_CREAT)
parent_inode = 
ceph_get_dentry_parent_inode(file-f_path.dentry);
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
iput(parent_inode);
--
2.4.3
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Questions]Can client know which OSDs are storing the data?

2015-01-26 Thread Ma, Jianpeng
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Dennis Chen
 Sent: Tuesday, January 27, 2015 3:07 PM
 To: ceph-devel@vger.kernel.org; Dennis Chen
 Subject: [Questions]Can client know which OSDs are storing the data?
 
 Hello Guys,
 
 My question is very rude and direct, a little bit stupid maybe ;-)
 
 Question a: a client write a file to the cluster (supposing replica = 3), so 
 the
 data will be stored in 3 OSDs within the cluster, can I get the information of
 which OSDs storing the file data in client side?
 
Ceph osd map poolname objectname
Display the osd and pg which store object.

 Question b: can the object data still be replicated if I store a object from 
 client
 with RADOS API?
Yes, rados api is also a client.
 
 Thank you guys !
 --
 Den
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: Ceph data consistency

2015-01-06 Thread Ma, Jianpeng
 -- Forwarded message --
 From: Paweł Sadowski c...@sadziu.pl
 Date: 2014-12-30 21:40 GMT+08:00
 Subject: Re: Ceph data consistency
 To: Vijayendra Shamanna vijayendra.shama...@sandisk.com,
 ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
 
 
 On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
  Hi,
 
  There is a sync thread (sync_entry in FileStore.cc) which triggers
  periodically and executes sync_filesystem() to ensure that the data is
  consistent. The journal entries are trimmed only after a successful
  sync_filesystem() call
 
 sync_filesystem() always returns zero and journal will be trimmed.
 Executing sync()/syncfs() with dirty data in disk buffers will result in data 
 loss
 (lost page write due to I/O error).
 
Hi sage:

From the git log, I see at first sync_filesystem() return the result of 
syncfs().
But in this commit 808c644248e486f44:
Improve use of syncfs.
Test syncfs return value and fallback to btrfs sync and then sync.
The author hope if syncfs() met error and sync() can resolve. Because sync() 
don't return result 
So it only return zero.
But which error can handle by this way? AFAK, no.
I suggest it directly return result of syncfs().

Jianpeng Ma
Thanks!


 I was doing some experiments simulating disk errors using Device Mapper
 error target. In this setup OSD was writing to broken disk without crashing.
 Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were
 discarded due to IO error.
 
 
  Thanks
  Viju
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Pawel Sadowski
  Sent: Tuesday, December 30, 2014 1:52 PM
  To: ceph-devel@vger.kernel.org
  Subject: Ceph data consistency
 
  Hi,
 
  On our Ceph cluster from time to time we have some inconsistent PGs (after
 deep-scrub). We have some issues with disk/sata cables/lsi controller causing
 IO errors from time to time (but that's not the point in this case).
 
  When IO error occurs on OSD journal partition everything works as is should
 - OSD is crashed and that's ok - Ceph will handle that.
 
  But when IO error occurs on OSD data partition during journal flush OSD
 continue to work. After calling *writev* (in buffer::list::write_fd) OSD does
 check return code from this call but does NOT verify if write has been 
 successful
 to disk (data are still only in memory and there is no fsync). That way OSD
 thinks that data has been stored on disk but it might be discarded (during 
 sync
 dirty page will be reclaimed and you'll see lost page write due to I/O 
 error in
 dmesg).
 
  Since there is no checksumming of data I just wanted to make sure that this
 is by design. Maybe there is a way to tell OSD to call fsync after write and 
 have
 data consistent?
 
 --
 PS
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html


RE: Ceph data consistency

2015-01-06 Thread Ma, Jianpeng
Hi Sage,
   Pull request is https://github.com/ceph/ceph/pull/3305.

Thanks!
Jianpeng Ma

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, January 7, 2015 10:18 AM
 To: Ma, Jianpeng
 Cc: c...@sadziu.pl; vijayendra.shama...@sandisk.com;
 ceph-devel@vger.kernel.org
 Subject: RE: Ceph data consistency
 
 On Wed, 7 Jan 2015, Ma, Jianpeng wrote:
   -- Forwarded message --
   From: Pawe? Sadowski c...@sadziu.pl
   Date: 2014-12-30 21:40 GMT+08:00
   Subject: Re: Ceph data consistency
   To: Vijayendra Shamanna vijayendra.shama...@sandisk.com,
   ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org
  
  
   On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
Hi,
   
There is a sync thread (sync_entry in FileStore.cc) which triggers
periodically and executes sync_filesystem() to ensure that the
data is consistent. The journal entries are trimmed only after a
successful
sync_filesystem() call
  
   sync_filesystem() always returns zero and journal will be trimmed.
   Executing sync()/syncfs() with dirty data in disk buffers will
   result in data loss (lost page write due to I/O error).
  
  Hi sage:
 
  From the git log, I see at first sync_filesystem() return the result of 
  syncfs().
  But in this commit 808c644248e486f44:
  Improve use of syncfs.
  Test syncfs return value and fallback to btrfs sync and then sync.
  The author hope if syncfs() met error and sync() can resolve. Because
  sync() don't return result So it only return zero.
  But which error can handle by this way? AFAK, no.
  I suggest it directly return result of syncfs().
 
 Yeah, that sounds right!
 
 sage
 
 
 
  Jianpeng Ma
  Thanks!
 
 
   I was doing some experiments simulating disk errors using Device
   Mapper error target. In this setup OSD was writing to broken disk
 without crashing.
   Every 5 seconds (filestore_max_sync_interval) kernel logs that some
   data were discarded due to IO error.
  
  
Thanks
Viju
-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Pawel
Sadowski
Sent: Tuesday, December 30, 2014 1:52 PM
To: ceph-devel@vger.kernel.org
Subject: Ceph data consistency
   
Hi,
   
On our Ceph cluster from time to time we have some inconsistent
PGs (after
   deep-scrub). We have some issues with disk/sata cables/lsi
   controller causing IO errors from time to time (but that's not the point 
   in
 this case).
   
When IO error occurs on OSD journal partition everything works as
is should
   - OSD is crashed and that's ok - Ceph will handle that.
   
But when IO error occurs on OSD data partition during journal
flush OSD
   continue to work. After calling *writev* (in buffer::list::write_fd)
   OSD does check return code from this call but does NOT verify if
   write has been successful to disk (data are still only in memory
   and there is no fsync). That way OSD thinks that data has been
   stored on disk but it might be discarded (during sync dirty page
   will be reclaimed and you'll see lost page write due to I/O error in 
   dmesg).
   
Since there is no checksumming of data I just wanted to make sure
that this
   is by design. Maybe there is a way to tell OSD to call fsync after
   write and have data consistent?
  
   --
   PS
   --
   To unsubscribe from this list: send the line unsubscribe
   ceph-devel in the body of a message to majord...@vger.kernel.org
   More majordomo info at http://vger.kernel.org/majordomo-info.html
 
 N?r??y??X???v???)?{.n?z?]zay?j ??f???h??w?
 ??
 
 ???j:+v???w zZ+???ji
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Latency Improvement Report for ShardedOpWQ

2014-09-28 Thread Ma, Jianpeng
Hi Somnath:
You mentioned: There is still one global lock we have; this is to protect 
pg_for_processing() and this we can't get rid of since we need to maintain op 
order within a pg.

But for most object operations, we only maintain the order of object. Why need 
maintain op order within a pg?
Can you explain in detail?

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Sunday, September 28, 2014 5:02 PM
 To: Dong Yuan
 Cc: ceph-devel
 Subject: RE: Latency Improvement Report for ShardedOpWQ
 
 Dong,
 This is mostly because of lock contention may be.
 You can tweak the number of shards in case of sharded WQ to see if it is
 improving this number or not.
 There is still one global lock we have; this is to protect 
 pg_for_processing() and
 this we can't get rid of since we need to maintain op order within a pg. This
 could be increasing latency as well. I would suggest you to measure this
 number in different stages within ShardedOpWQ::_process() like after dequeue
 from pqueue and after getting the pglock and popping the ops from
 pg_for_processing().
 
 Also, keep in mind there is context switch happening and this could be
 expensive depending on the data copy etc. It's worth trying this experiment by
 pinning OSD to may be actual physical cores ?
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Dong Yuan [mailto:yuandong1...@gmail.com]
 Sent: Sunday, September 28, 2014 12:19 AM
 To: Somnath Roy
 Cc: ceph-devel
 Subject: Re: Latency Improvement Report for ShardedOpWQ
 
 Hi Somnath,
 
 I totally agree with you.
 
 I read the code about  sharded TP and the new OSD OpWQ. In the new
 implementation, there is not  single lock for all PGs, but each lock
 for a subset of PGs(Am I right?).   It is very useful to reduce lock
 contention and so increase parallelism. It is an awesome work!
 
 While I am working on the latency of single IO (mainly 4K random write), I
 notice the OpWQ spent about 100+us to transfer an IO from msg dispatcher to
 OpWQ worker thread, Do you have any idea to reduce the time span?
 
 Thanks for your help.
 Dong.
 
 On 28 September 2014 13:46, Somnath Roy somnath@sandisk.com
 wrote:
  Hi Dong,
  I don't think in case of single client scenario there is much benefit. 
  Single
 client has a limitation. The benefit with sharded TP is, a single OSD is 
 scaling
 much more with the increase of clients since it is increasing parallelism (by
 reducing lock contention) in the filestore level. A quick check could be like 
 this.
 
  1. Create a single node, single OSD cluster and try putting load with
 increasing number of clients like 1,3, 5, 8,10. Small workload serving from
 memory should be ideal.
  2. Compare the code with sharded TP against say firefly. You should be 
  seeing
 firefly is not scaling with increasing number of clients.
  3. try top -H on two different case and you should be seeing more threads in
 case of sharded tp were working in parallel than firefly.
 
  Also, I am sure this latency result will not hold true in high workload , 
  there
 you should be seeing more contention and as a result more latency.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Dong Yuan
  Sent: Saturday, September 27, 2014 8:45 PM
  To: ceph-devel
  Subject: Latency Improvement Report for ShardedOpWQ
 
  = Test Purpose =
 
  Measure whether and how much Sharded OpWQ is better than Traditional
 OpWQ for random write scene.
 
  = Test Case =
 
  4K Object WriteFull for 1w times.
 
  = Test Method =
 
  Put the following static probes into codes when running tests to get the 
  time
 span between enqeueue and dequeue of OpWQ.
 
  Start: PG::enqueue_op before osd-op_wq.equeue call
  End: OSD::dequeue_op.entry
 
  = Test Result =
 
  Traditional OpWQ: 109us(AVG), 40us(MIN)
  ShardedOpWQ: 97us(AVG), 32us(MIN)
 
  = Test Conclusion =
 
  No Remarkably Improvement for Latency
 
 
  --
  Dong Yuan
  Email:yuandong1...@gmail.com
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html
 
  
 
  PLEASE NOTE: The information contained in this electronic mail message is
 intended only for the use of the designated recipient(s) named above. If the
 reader of this message is not the intended recipient, you are hereby notified
 that you have received this message in error and that any review,
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If
 you have received this communication in error, please notify the sender by
 telephone or e-mail (as shown above) immediately and destroy any and all
 copies of this message in your possession (whether hard 

RE: [PATCH 2/3] ec: make use of added aligned buffers

2014-09-16 Thread Ma, Jianpeng
 -Original Message-
 From: Loic Dachary [mailto:l...@dachary.org]
 Sent: Tuesday, September 16, 2014 2:47 PM
 To: Ma, Jianpeng
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
 
 
 
 On 16/09/2014 02:08, Ma, Jianpeng wrote:
  -Original Message-
  From: Sage Weil [mailto:sw...@redhat.com]
  Sent: Tuesday, September 16, 2014 8:02 AM
  To: Ma, Jianpeng
  Cc: Loic Dachary; Janne Grunau; ceph-devel@vger.kernel.org
  Subject: RE: [PATCH 2/3] ec: make use of added aligned buffers
 
  On Mon, 15 Sep 2014, Ma, Jianpeng wrote:
  If we modify bufferlist::c_str()  to bufferlist::c_str(bool align).
  If (align)
Posix_memalign(data, CEPH_PAGE_SIZE, len) Else
Origin code.
 
  Alignment isn't really a bool, it's an int.  c_str(int align=1) ?
 
  I mean if we need a align memory after bufferlist::c_str. We can set
  the align = true; Now bufferlist::c_str depend on the size when using 
  rebuild if
 bufferlist have more prt.
 
  BTW, we also can change the rebuild()  rebuild(ptr). Merge two func into
 one rebuild(bool align).
  By judge the parameter align to alloc align memory or not.
 
  Or am I missing something?
 
 I don't think there is a need for c_str(int align). We make every effort to 
 allocate
 buffers that are properly aligned. If c_str() does not return an aligned 
 buffer,
 the proper fix is to align the allocated buffer at the source, not to 
 allocate a
 new aligned buffer and copy the content of the non aligned buffer into it.
 

 Do you see a reason why that would be a bad way to deal with alignment ?
 We only guarantee memory align after c_str and don't depend on the previous 
steps.
If encode[i] have more ptr suppose they all aligned memory.
But how to guarantee aligned memory after c_str.
If bufferlist have more ptr, the aligned memory depend on the size of 
bufferlist.

Jianpeng
 
 Cheers
 
 
  Jianpeng
  sage
 
 
  I think this is simple and correctly code.
 
  Jianpeng
  Thanks!
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
  Sent: Tuesday, September 16, 2014 1:20 AM
  To: Janne Grunau; ceph-devel@vger.kernel.org
  Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
 
  Hi Janne,
 
  See below:
 
  On 15/09/2014 17:55, Janne Grunau wrote:
  Requiring page aligned buffers and realigning the input if
  necessary creates measurable oberhead.
 ceph_erasure_code_benchmark
  is ~30% faster with this change for technique=reed_sol_van,k=2,m=1.
 
  Also prevents a misaligned buffer when
  bufferlist::c_str(bufferlist) has to allocate a new buffer to
  provide continuous one. See bug #9408
 
  Signed-off-by: Janne Grunau j...@jannau.net
  ---
   src/erasure-code/ErasureCode.cc | 46
  +
   src/erasure-code/ErasureCode.h  |  3 ++-
   2 files changed, 30 insertions(+), 19 deletions(-)
 
  diff --git a/src/erasure-code/ErasureCode.cc
  b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644
  --- a/src/erasure-code/ErasureCode.cc
  +++ b/src/erasure-code/ErasureCode.cc
  @@ -54,22 +54,38 @@ int
  ErasureCode::minimum_to_decode_with_cost(const
  setint want_to_read,  }
 
   int ErasureCode::encode_prepare(const bufferlist raw,
  -bufferlist *prepared) const
  +mapint, bufferlist encoded)
  const
   {
 unsigned int k = get_data_chunk_count();
 unsigned int m = get_chunk_count() - k;
 unsigned blocksize = get_chunk_size(raw.length());
  -  unsigned padded_length = blocksize * k;
  -  *prepared = raw;
  -  if (padded_length - raw.length()  0) {
  -bufferptr pad(padded_length - raw.length());
  -pad.zero();
  -prepared-push_back(pad);
  +  unsigned pad_len = blocksize * k - raw.length();
  +
  +  bufferlist prepared = raw;
  +
  +  if (!prepared.is_aligned()) {
  +prepared.rebuild_aligned();
  +  }
  +
  +  for (unsigned int i = 0; i  k - !!pad_len; i++) {
  +int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
  +bufferlist chunk = encoded[chunk_index];
  +chunk.substr_of(prepared, i * blocksize, blocksize);  }
 
  It is possible for more than one chunk to be padding. It's a border
  case but... for instance with alignment = 16, k=12 and in of length
  1550 you end up with two padding chunks because the blocksize is 144.
 
  +  if (pad_len  0) {
  +int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[k
  + - 1] : k -
  1;
  +bufferlist chunk = encoded[chunk_index];
  +bufferptr padded(buffer::create_aligned(blocksize));
  +raw.copy((k - 1) * blocksize, blocksize - pad_len, padded.c_str());
  +padded.zero(blocksize - pad_len, pad_len);
  +chunk.push_back(padded);
 }
  -  unsigned coding_length = blocksize * m;
  -  bufferptr coding(buffer::create_page_aligned(coding_length));
  -  prepared-push_back(coding);
  -  prepared-rebuild_page_aligned

RE: [PATCH 2/3] ec: make use of added aligned buffers

2014-09-16 Thread Ma, Jianpeng
I see it .Thanks very much!

Jianpeng
 -Original Message-
 From: Loic Dachary [mailto:l...@dachary.org]
 Sent: Tuesday, September 16, 2014 3:55 PM
 To: Ma, Jianpeng
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
 
 
 
 On 16/09/2014 08:59, Ma, Jianpeng wrote:
  -Original Message-
  From: Loic Dachary [mailto:l...@dachary.org]
  Sent: Tuesday, September 16, 2014 2:47 PM
  To: Ma, Jianpeng
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
 
 
 
  On 16/09/2014 02:08, Ma, Jianpeng wrote:
  -Original Message-
  From: Sage Weil [mailto:sw...@redhat.com]
  Sent: Tuesday, September 16, 2014 8:02 AM
  To: Ma, Jianpeng
  Cc: Loic Dachary; Janne Grunau; ceph-devel@vger.kernel.org
  Subject: RE: [PATCH 2/3] ec: make use of added aligned buffers
 
  On Mon, 15 Sep 2014, Ma, Jianpeng wrote:
  If we modify bufferlist::c_str()  to bufferlist::c_str(bool align).
  If (align)
Posix_memalign(data, CEPH_PAGE_SIZE, len) Else
Origin code.
 
  Alignment isn't really a bool, it's an int.  c_str(int align=1) ?
 
  I mean if we need a align memory after bufferlist::c_str. We can set
  the align = true; Now bufferlist::c_str depend on the size when
  using rebuild if
  bufferlist have more prt.
 
  BTW, we also can change the rebuild()  rebuild(ptr). Merge two
  func into
  one rebuild(bool align).
  By judge the parameter align to alloc align memory or not.
 
  Or am I missing something?
 
  I don't think there is a need for c_str(int align). We make every
  effort to allocate buffers that are properly aligned. If c_str() does
  not return an aligned buffer, the proper fix is to align the
  allocated buffer at the source, not to allocate a new aligned buffer and 
  copy
 the content of the non aligned buffer into it.
 
 
  Do you see a reason why that would be a bad way to deal with alignment ?
   We only guarantee memory align after c_str and don't depend on the
 previous steps.
 
 This is a benefit indeed: less room for error in general.
 
  If encode[i] have more ptr suppose they all aligned memory.
  But how to guarantee aligned memory after c_str.
  If bufferlist have more ptr, the aligned memory depend on the size of
 bufferlist.
 
 The ErasureCode::encode_prepare prepares the allocated buffer so that
 
 *) it holds enough room to contain all chunks
 *) c_str() on a chunk boundary will return a pointer that does not require
 allocating memory
 
 The ErasureCodeJerasure::get_chunk_size function determines the size of the
 buffer allocated by encode_prepare and further guarantees that each chunk is:
 
 *) size aligned on 16 bytes
 *) memory aligned on 16 bytes so that SIMD instructions can use it without
 moving it
 
 In other words, if c_str() had to re-allocate the buffer because it is not 
 aligned,
 it would mean that the allocation of the buffer is not done as it should.
 

 Cheers
 
  Jianpeng
 
  Cheers
 
 
  Jianpeng
  sage
 
 
  I think this is simple and correctly code.
 
  Jianpeng
  Thanks!
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic
  Dachary
  Sent: Tuesday, September 16, 2014 1:20 AM
  To: Janne Grunau; ceph-devel@vger.kernel.org
  Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
 
  Hi Janne,
 
  See below:
 
  On 15/09/2014 17:55, Janne Grunau wrote:
  Requiring page aligned buffers and realigning the input if
  necessary creates measurable oberhead.
  ceph_erasure_code_benchmark
  is ~30% faster with this change for technique=reed_sol_van,k=2,m=1.
 
  Also prevents a misaligned buffer when
  bufferlist::c_str(bufferlist) has to allocate a new buffer to
  provide continuous one. See bug #9408
 
  Signed-off-by: Janne Grunau j...@jannau.net
  ---
   src/erasure-code/ErasureCode.cc | 46
  +
   src/erasure-code/ErasureCode.h  |  3 ++-
   2 files changed, 30 insertions(+), 19 deletions(-)
 
  diff --git a/src/erasure-code/ErasureCode.cc
  b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644
  --- a/src/erasure-code/ErasureCode.cc
  +++ b/src/erasure-code/ErasureCode.cc
  @@ -54,22 +54,38 @@ int
  ErasureCode::minimum_to_decode_with_cost(const
  setint want_to_read,  }
 
   int ErasureCode::encode_prepare(const bufferlist raw,
  -bufferlist *prepared) const
  +mapint, bufferlist
 encoded)
  const
   {
 unsigned int k = get_data_chunk_count();
 unsigned int m = get_chunk_count() - k;
 unsigned blocksize = get_chunk_size(raw.length());
  -  unsigned padded_length = blocksize * k;
  -  *prepared = raw;
  -  if (padded_length - raw.length()  0) {
  -bufferptr pad(padded_length - raw.length());
  -pad.zero();
  -prepared-push_back(pad);
  +  unsigned pad_len = blocksize * k - raw.length();
  +
  +  bufferlist prepared = raw

RE: [PATCH 2/3] ec: make use of added aligned buffers

2014-09-15 Thread Ma, Jianpeng
If we modify bufferlist::c_str()  to bufferlist::c_str(bool align).
If (align)
  Posix_memalign(data, CEPH_PAGE_SIZE, len)
Else
  Origin code.

I think this is simple and correctly code.

Jianpeng
Thanks!

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
 Sent: Tuesday, September 16, 2014 1:20 AM
 To: Janne Grunau; ceph-devel@vger.kernel.org
 Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
 
 Hi Janne,
 
 See below:
 
 On 15/09/2014 17:55, Janne Grunau wrote:
  Requiring page aligned buffers and realigning the input if necessary
  creates measurable oberhead. ceph_erasure_code_benchmark is ~30%
  faster with this change for technique=reed_sol_van,k=2,m=1.
 
  Also prevents a misaligned buffer when bufferlist::c_str(bufferlist)
  has to allocate a new buffer to provide continuous one. See bug #9408
 
  Signed-off-by: Janne Grunau j...@jannau.net
  ---
   src/erasure-code/ErasureCode.cc | 46
  +
   src/erasure-code/ErasureCode.h  |  3 ++-
   2 files changed, 30 insertions(+), 19 deletions(-)
 
  diff --git a/src/erasure-code/ErasureCode.cc
  b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644
  --- a/src/erasure-code/ErasureCode.cc
  +++ b/src/erasure-code/ErasureCode.cc
  @@ -54,22 +54,38 @@ int
 ErasureCode::minimum_to_decode_with_cost(const
  setint want_to_read,  }
 
   int ErasureCode::encode_prepare(const bufferlist raw,
  -bufferlist *prepared) const
  +mapint, bufferlist encoded)
 const
   {
 unsigned int k = get_data_chunk_count();
 unsigned int m = get_chunk_count() - k;
 unsigned blocksize = get_chunk_size(raw.length());
  -  unsigned padded_length = blocksize * k;
  -  *prepared = raw;
  -  if (padded_length - raw.length()  0) {
  -bufferptr pad(padded_length - raw.length());
  -pad.zero();
  -prepared-push_back(pad);
  +  unsigned pad_len = blocksize * k - raw.length();
  +
  +  bufferlist prepared = raw;
  +
  +  if (!prepared.is_aligned()) {
  +prepared.rebuild_aligned();
  +  }
  +
  +  for (unsigned int i = 0; i  k - !!pad_len; i++) {
  +int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
  +bufferlist chunk = encoded[chunk_index];
  +chunk.substr_of(prepared, i * blocksize, blocksize);  }
 
 It is possible for more than one chunk to be padding. It's a border case 
 but... for
 instance with alignment = 16, k=12 and in of length 1550 you end up with two
 padding chunks because the blocksize is 144.
 
  +  if (pad_len  0) {
  +int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[k - 1] : k -
 1;
  +bufferlist chunk = encoded[chunk_index];
  +bufferptr padded(buffer::create_aligned(blocksize));
  +raw.copy((k - 1) * blocksize, blocksize - pad_len, padded.c_str());
  +padded.zero(blocksize - pad_len, pad_len);
  +chunk.push_back(padded);
 }
  -  unsigned coding_length = blocksize * m;
  -  bufferptr coding(buffer::create_page_aligned(coding_length));
  -  prepared-push_back(coding);
  -  prepared-rebuild_page_aligned();
  +  for (unsigned int i = k; i  k + m; i++) {
  +int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
  +bufferlist chunk = encoded[chunk_index];
  +chunk.push_back(buffer::create_aligned(blocksize));
  +  }
  +
 return 0;
   }
 
  @@ -80,15 +96,9 @@ int ErasureCode::encode(const setint
 want_to_encode,
 unsigned int k = get_data_chunk_count();
 unsigned int m = get_chunk_count() - k;
 bufferlist out;
  -  int err = encode_prepare(in, out);
  +  int err = encode_prepare(in, *encoded);
 if (err)
   return err;
  -  unsigned blocksize = get_chunk_size(in.length());
  -  for (unsigned int i = 0; i  k + m; i++) {
  -int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
  -bufferlist chunk = (*encoded)[chunk_index];
  -chunk.substr_of(out, i * blocksize, blocksize);
  -  }
 encode_chunks(want_to_encode, encoded);
 for (unsigned int i = 0; i  k + m; i++) {
   if (want_to_encode.count(i) == 0) diff --git
  a/src/erasure-code/ErasureCode.h b/src/erasure-code/ErasureCode.h
  index 7aaea95..62aa383 100644
  --- a/src/erasure-code/ErasureCode.h
  +++ b/src/erasure-code/ErasureCode.h
  @@ -46,7 +46,8 @@ namespace ceph {
   const mapint, int
 available,
   setint *minimum);
 
  -int encode_prepare(const bufferlist raw, bufferlist *prepared) const;
  +int encode_prepare(const bufferlist raw,
  +   mapint, bufferlist encoded) const;
 
   virtual int encode(const setint want_to_encode,
  const bufferlist in,
 
 
 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to 

RE: [PATCH 2/3] ec: make use of added aligned buffers

2014-09-15 Thread Ma, Jianpeng
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Tuesday, September 16, 2014 8:02 AM
 To: Ma, Jianpeng
 Cc: Loic Dachary; Janne Grunau; ceph-devel@vger.kernel.org
 Subject: RE: [PATCH 2/3] ec: make use of added aligned buffers
 
 On Mon, 15 Sep 2014, Ma, Jianpeng wrote:
  If we modify bufferlist::c_str()  to bufferlist::c_str(bool align).
  If (align)
Posix_memalign(data, CEPH_PAGE_SIZE, len) Else
Origin code.
 
 Alignment isn't really a bool, it's an int.  c_str(int align=1) ?
 
I mean if we need a align memory after bufferlist::c_str. We can set the align 
= true; 
Now bufferlist::c_str depend on the size when using rebuild if bufferlist have 
more prt.

BTW, we also can change the rebuild()  rebuild(ptr). Merge two func into one 
rebuild(bool align).
By judge the parameter align to alloc align memory or not.

Or am I missing something?

Jianpeng
 sage
 
 
  I think this is simple and correctly code.
 
  Jianpeng
  Thanks!
 
   -Original Message-
   From: ceph-devel-ow...@vger.kernel.org
   [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
   Sent: Tuesday, September 16, 2014 1:20 AM
   To: Janne Grunau; ceph-devel@vger.kernel.org
   Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers
  
   Hi Janne,
  
   See below:
  
   On 15/09/2014 17:55, Janne Grunau wrote:
Requiring page aligned buffers and realigning the input if
necessary creates measurable oberhead. ceph_erasure_code_benchmark
is ~30% faster with this change for technique=reed_sol_van,k=2,m=1.
   
Also prevents a misaligned buffer when
bufferlist::c_str(bufferlist) has to allocate a new buffer to
provide continuous one. See bug #9408
   
Signed-off-by: Janne Grunau j...@jannau.net
---
 src/erasure-code/ErasureCode.cc | 46
+
 src/erasure-code/ErasureCode.h  |  3 ++-
 2 files changed, 30 insertions(+), 19 deletions(-)
   
diff --git a/src/erasure-code/ErasureCode.cc
b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644
--- a/src/erasure-code/ErasureCode.cc
+++ b/src/erasure-code/ErasureCode.cc
@@ -54,22 +54,38 @@ int
   ErasureCode::minimum_to_decode_with_cost(const
setint want_to_read,  }
   
 int ErasureCode::encode_prepare(const bufferlist raw,
-bufferlist *prepared) const
+mapint, bufferlist encoded)
   const
 {
   unsigned int k = get_data_chunk_count();
   unsigned int m = get_chunk_count() - k;
   unsigned blocksize = get_chunk_size(raw.length());
-  unsigned padded_length = blocksize * k;
-  *prepared = raw;
-  if (padded_length - raw.length()  0) {
-bufferptr pad(padded_length - raw.length());
-pad.zero();
-prepared-push_back(pad);
+  unsigned pad_len = blocksize * k - raw.length();
+
+  bufferlist prepared = raw;
+
+  if (!prepared.is_aligned()) {
+prepared.rebuild_aligned();
+  }
+
+  for (unsigned int i = 0; i  k - !!pad_len; i++) {
+int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
+bufferlist chunk = encoded[chunk_index];
+chunk.substr_of(prepared, i * blocksize, blocksize);  }
  
   It is possible for more than one chunk to be padding. It's a border
   case but... for instance with alignment = 16, k=12 and in of length
   1550 you end up with two padding chunks because the blocksize is 144.
  
+  if (pad_len  0) {
+int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[k
+ - 1] : k -
   1;
+bufferlist chunk = encoded[chunk_index];
+bufferptr padded(buffer::create_aligned(blocksize));
+raw.copy((k - 1) * blocksize, blocksize - pad_len, padded.c_str());
+padded.zero(blocksize - pad_len, pad_len);
+chunk.push_back(padded);
   }
-  unsigned coding_length = blocksize * m;
-  bufferptr coding(buffer::create_page_aligned(coding_length));
-  prepared-push_back(coding);
-  prepared-rebuild_page_aligned();
+  for (unsigned int i = k; i  k + m; i++) {
+int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
+bufferlist chunk = encoded[chunk_index];
+chunk.push_back(buffer::create_aligned(blocksize));
+  }
+
   return 0;
 }
   
@@ -80,15 +96,9 @@ int ErasureCode::encode(const setint
   want_to_encode,
   unsigned int k = get_data_chunk_count();
   unsigned int m = get_chunk_count() - k;
   bufferlist out;
-  int err = encode_prepare(in, out);
+  int err = encode_prepare(in, *encoded);
   if (err)
 return err;
-  unsigned blocksize = get_chunk_size(in.length());
-  for (unsigned int i = 0; i  k + m; i++) {
-int chunk_index = chunk_mapping.size()  0 ? chunk_mapping[i] : i;
-bufferlist chunk = (*encoded)[chunk_index];
-chunk.substr_of(out, i

RE: Deadlock in ceph journal

2014-08-24 Thread Ma, Jianpeng
Hi all,
   At weekend, I read the kernel code about aio  direction. For close(), it 
don't wait aio to complete.
But for fsync(), it will wait all aio to complete. 
   Mark used this patch(which using fsync() on write_thread_entry) and the 
result is looks good.
   I want to revert the patch which don't use aio when closing journal. And 
using fsync(). It make the code simple.
   How about this?

Thanks!
Jianpeng

 ceph-devel@vger.kernel.org
 Subject: Re: Deadlock in ceph journal
 
 On 23/08/14 10:22, Somnath Roy wrote:
  I think it is using direct io for non-aio mode as well.
 
  Thanks  Regards
  Somnath
 
 
  One thing that does still concern me - if I understand what is happening 
  here
 correctly: we write to the journal using aio until we want to stop doing 
 writes
 (presumably pre closing it), then use normal io to write at that point.
 
  Given that we appear to be using direct io whenever we use aio, does this
 mean we end up mixing direct and buffered io to the journal [1] (or is the
 normal i.e non aio write still using direct io)?
 
 
 
 Thanks Somnath,
 
 I think you are correct (I missed the bit in FileJournal::_open that seems to
 cover this):
 
 
if (forwrite) {
  flags = O_RDWR;
  if (directio)
flags |= O_DIRECT | O_DSYNC;
 
 
 i.e the journal is opened with DIRECT, so all writes (async or not) will
 be direct.
 
 Cheers
 
 Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Deadlock in ceph journal

2014-08-24 Thread Ma, Jianpeng
The attachment is the patch.

Jianpeng

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Monday, August 25, 2014 10:15 AM
 To: Ma, Jianpeng
 Cc: Mark Kirkwood; Somnath Roy; Samuel Just (sam.j...@inktank.com);
 ceph-devel@vger.kernel.org
 Subject: RE: Deadlock in ceph journal
 
 Sounds good. Can you send a patch?
 
 sage
 
 
 On Mon, 25 Aug 2014, Ma, Jianpeng wrote:
 
  Hi all,
 At weekend, I read the kernel code about aio  direction. For close(), it
 don't wait aio to complete.
  But for fsync(), it will wait all aio to complete.
 Mark used this patch(which using fsync() on write_thread_entry) and the
 result is looks good.
 I want to revert the patch which don't use aio when closing journal. And
 using fsync(). It make the code simple.
 How about this?
 
  Thanks!
  Jianpeng
 
   ceph-devel@vger.kernel.org
   Subject: Re: Deadlock in ceph journal
  
   On 23/08/14 10:22, Somnath Roy wrote:
I think it is using direct io for non-aio mode as well.
   
Thanks  Regards
Somnath
   
  
One thing that does still concern me - if I understand what is
happening here
   correctly: we write to the journal using aio until we want to stop
   doing writes (presumably pre closing it), then use normal io to write at 
   that
 point.
   
Given that we appear to be using direct io whenever we use aio,
does this
   mean we end up mixing direct and buffered io to the journal [1] (or
   is the normal i.e non aio write still using direct io)?
   
   
  
   Thanks Somnath,
  
   I think you are correct (I missed the bit in FileJournal::_open that
   seems to cover this):
  
  
  if (forwrite) {
flags = O_RDWR;
if (directio)
  flags |= O_DIRECT | O_DSYNC;
  
  
   i.e the journal is opened with DIRECT, so all writes (async or not)
   will be direct.
  
   Cheers
  
   Mark
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html


0001-os-FileJournal-Before-write_thread_entry-exit-using-.patch
Description: 0001-os-FileJournal-Before-write_thread_entry-exit-using-.patch


RE: Deadlock in ceph journal

2014-08-20 Thread Ma, Jianpeng
Yes, Maybe for io_submit, it must use io_getevent. Otherwise the result is 
undefined.
If stop_write == true, we don't use aio. How about this way?

Jianpeng

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, August 20, 2014 11:34 PM
 To: Somnath Roy
 Cc: Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org; Mark
 Kirkwood; Ma, Jianpeng
 Subject: RE: Deadlock in ceph journal
 
 I suspect what is really needed is a drain_aio() function that will wait for 
 all
 pending aio ops to complete on shutdown.  What happens to those IOs if the
 process exists while they are in flight is probably undefined; we should just
 avoid doing that.
 
 sage
 
 
 On Wed, 20 Aug 2014, Somnath Roy wrote:
 
  I will also take the patch and test it out.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Sage Weil [mailto:sw...@redhat.com]
  Sent: Tuesday, August 19, 2014 9:51 PM
  To: Somnath Roy
  Cc: Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org;
  Mark Kirkwood; jianpeng...@intel.com
  Subject: RE: Deadlock in ceph journal
 
  On Wed, 20 Aug 2014, Somnath Roy wrote:
   Thanks Sage !
   So, the latest master should have the fix, right ?
 
  The original patch that caused the regression is reverted, but we'd like to
 reapply it if we sort out the issues.  wip-filejournal has the offending 
 patch and
 your fix.. but I'm eager to hear if Jianpeng and Mark can confirm it's
 complete/correct or if there is still a problem.
 
  sage
 
  
   Regards
   Somnath
  
   -Original Message-
   From: Sage Weil [mailto:sw...@redhat.com]
   Sent: Tuesday, August 19, 2014 8:55 PM
   To: Somnath Roy
   Cc: Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org;
   Mark Kirkwood; jianpeng...@intel.com
   Subject: RE: Deadlock in ceph journal
  
   [Copying ceph-devel, dropping ceph-users]
  
   Yeah, that looks like a bug.  I pushed wip-filejournal that reapplies
 Jianpeng's original patch and this one.  I'm not certain about last other
 suggested fix, though, but I'm hoping that this fix explains the strange 
 behavior
 Jianpeng and Mark have seen?
  
   sage
  
  
   On Wed, 20 Aug 2014, Somnath Roy wrote:
   
I think this is the issue..
   
   
   
http://tracker.ceph.com/issues/9073
   
   
   
Thanks  Regards
   
Somnath
   
   
   
From: Somnath Roy
Sent: Tuesday, August 19, 2014 6:25 PM
To: Sage Weil (s...@inktank.com); Samuel Just
(sam.j...@inktank.com)
Cc: ceph-us...@lists.ceph.com
Subject: Deadlock in ceph journal
   
   
   
Hi Sage/Sam,
   
During our testing we found a potential deadlock scenario in the
filestore journal code base. This is happening because of two reason.
   
   
   
1.   This is because code is not signaling aio_cond from
check_aio_completion() in case seq = 0
   
2.   Following changes in the write_thread_entry() is allowing a
very first header write with seq = 0.
   
   if (writeq.empty()  !must_write_header) {
   
   
   
   
   
Now, during ceph-deploy activate this is what happening.
   
   
   
1. The very first write of header with seq = 0 issued and it is
waiting for aio completion. So, aio_num = 1.
   
2. superblock write came in and got into while (aio_num  0) block
of
write_thread_entry() and waiting on the aio_cond
   
3. The seq = 0 aio completed but not setting completed_something =
true and as a result aio_cond is not signaled.
   
4. write_thread_entry() is getting into deadlock.
   
   
   
This is a timing problem and if header write is returned before
superblock write this will not happen and will be happening in
case of block journal device only (aio is enabled).
   
   
   
Here is the log snippet we are getting.
   
   
   
2014-08-19 12:59:10.029363 7f60fa33b700 10 journal
write_thread_entry start
   
2014-08-19 12:59:10.029395 7f60fa33b700 20 journal
prepare_multi_write queue_pos now 4096
   
2014-08-19 12:59:10.029427 7f60fa33b700 15 journal do_aio_write
writing
4096~0 + header
   
2014-08-19 12:59:10.029439 7f60fa33b700 20 journal write_aio_bl
0~4096 seq 0
   
2014-08-19 12:59:10.029442 7f60f9339700 10 journal
write_finish_thread_entry enter
   
2014-08-19 12:59:10.029466 7f60fa33b700 20 journal write_aio_bl ..
0~4096 in
1
   
2014-08-19 12:59:10.029498 7f60fa33b700 20 journal write_aio_bl
4096~0 seq 0
   
2014-08-19 12:59:10.029505 7f60fa33b700  5 journal put_throttle
finished 0 ops and 0 bytes, now 0 ops and 0 bytes
   
2014-08-19 12:59:10.029510 7f60fa33b700 20 journal
write_thread_entry going to sleep
   
2014-08-19 12:59:10.029538 7f60ff178800 10 journal journal_start
   
2014-08-19 12:59:10.029566 7f60f9339700 20 journal
write_finish_thread_entry waiting for aio(s)
   
2014-08-19 12:59:10.029726 7f60ff178800 15
filestore(/var/lib/ceph/tmp

RE: A bug in FileStore::mount and A problem about XFS

2014-08-14 Thread Ma, Jianpeng
I think you got the old code. I'm not find FX_TYPE_OTHER


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of n...@cs.hku.hk
 Sent: Thursday, August 14, 2014 12:08 PM
 To: ceph-devel@vger.kernel.org
 Subject: A bug in FileStore::mount and A problem about XFS
 
 
 Dear developers,
 
 I find a bug in FileStore::mount() function. If we use raw partitial as a 
 Journal
 disk, like /dev/sda1, then it will cause a error when setting the parameters
 m_filestore_max_inline_xattr_size and m_filestore_max_inline_xattrs.
 
 First of all, when creating the FileStore, it calls the mount() and detect 
 fs_type
 as XFS, so it will check m_filestore_replica_fadvise(true default) and set it 
 to
 false. At the same time, m_fs_tpye = FS_TYPE_XFS.
 
 However, when create JournalStore, it also enters mount() function and detect
 fs_type. Unfourtuanlly, this time, m_filestore_replica_fadvise = false so 
 that it
 cannot update m_fs_type = FS_TYPE_XFS. Instread, it will sets to m_fs_tpye =
 FS_TYPE_OTHER. This bug is quite easy to repair.
 
 
 By the way, I wanna ask another question about XFS, I find that if I use ceph
 cluster heavily for about 2-3 hour. XFS always wakes up a kernel thread 
 (xfsaild)
 to deal with its inodes reading and writing.
 This will impact the performance a lot and make the cluster's throughput drop
 heavily. So how to solve this problem and why this occurs after running a 
 heavily
 read, write operation after 2-3 hour.
 
 Best Regards,
 Neal Yao
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ISA erasure code plugin and cache

2014-08-04 Thread Ma, Jianpeng
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary
 Sent: Monday, August 4, 2014 8:37 PM
 To: Andreas Joachim Peters
 Cc: Ma, Jianpeng; Ceph Development
 Subject: Re: ISA erasure code plugin and cache
 
 
 
 On 04/08/2014 14:15, Andreas Joachim Peters wrote: Hi Loic,
 
  the background relevant to your comments have (unfortunately) never been
 answered on the mailing list.
 
  The cache is written in a way, that it is useful for a fixed (k,m) 
  combination
 and thread-safe.
 
  So, if there is one instance of the plugin per pool, it is the right
 implementation. If there is (as you write) one instance for each PG in any 
 pool,
 it is sort of 'stupid' because the same encoding table is stored for each PG
 seperatly.
 
 I would not called it stupid ;-) Just not as efficient as if the cache was by 
 all PGs.
 Without cache the decode table has to be calculated for each object in the
 placement group and there are a lot of objects. The table may be duplicated
 hundreds of time so it has an impact on the memory footprint, but it should 
 not
 have a visible impact on the decode performances. An optimisation of your
 implementation to save memory would be nice, but it is not critical.
 
AFAIK, for a recovery pg, all the objects of pg have the same lost chunks. So 
only the first object miss the cache.
But the later won't.  It only need a entry to cache.
Or am I missing something?

Jianpeng Ma

 How large are the decode tables ?
 
  So if the final statement is to create one plugin instance per PG, I will 
  change
 it accordingly and shared encoding  decoding tables for a fixed (k,m), if 
 not it
 can stay.
 
  Just need to know that ... this boils down to the fact, that encoding 
 decoding should not be considered 'stateless'.
 
  Cheers Andreas.
 
 
  
  From: Loic Dachary [l...@dachary.org]
  Sent: 04 August 2014 13:56
  To: Andreas Joachim Peters
  Cc: Ma, Jianpeng; Ceph Development
  Subject: ISA erasure code plugin and cache
 
  Hi,
 
  Here is how I understand the current code:
 
  When an OSD is missing, recovery is required and the primary OSD will 
  collect
 the available chunks to do so. It will then call the decode method via
 ECUtil::decode which is a small wrapper around the corresponding
 ErasureCodeInterface::decode method.
 
 https://github.com/ceph/ceph/blob/master/src/osd/ECBackend.cc#L361
 
  The ISA plugin will then use the isa_decode method to perform the work
 
 
 https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode
 Isa.cc#L212
 
  and will be repeatedly called until all objects in the PGs that were 
  relying on
 the missing OSD are recovered. To avoid computing the decoding table for each
 object, it is stored in a LRU cache
 
 
 https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode
 Isa.cc#L480
 
  and copied in the stack if already there:
 
 
 https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode
 Isa.cc#L433
 
  Each PG has a separate instance of ErasureCodeIsa, obtained when it is
 created:
 
 https://github.com/ceph/ceph/blob/master/src/osd/PGBackend.cc#L292
 
  It means that data members of each ErasureCodeIsa are copied as many
 times as there are PGs. If an OSD handles participates in 200 PG that belong 
 to
 an erasure coded pool configured to use ErasureCodeIsa, the data members
 will be duplicated 200 times.
 
  It is good practice to make it so that the encode/decode methods of
 ErasureCodeIsa are thread safe. In the jerasure plugin these methods have no
 side effect on the object. In the isa plugin the LRU cache storing the decode
 tables is modified by the decode method and guarded by a mutex:
 
 get
 https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode
 Isa.cc#L281
 put
 https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode
 Isa.cc#L310
 
  Please correct me if I'm mistaken ;-) I've not reviewed the code yet and 
  try to
 find problems, I just wanted to make sure I get the intention before doing so.
 
  Cheers
  --
  Loïc Dachary, Artisan Logiciel Libre
 
 
 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH]ECBackend: Don't directyly use get_recovery_chunk_size() in RecoveryOp::WRITING state.

2014-07-30 Thread Ma, Jianpeng
We cannot guarantee that conf-osd_recovery_max_chunk don't change when
recoverying a erasure object.
If change between RecoveryOp::READING and RecoveryOp::WRITING, it can cause 
this bug:

2014-07-30 10:12:09.599220 7f7ff26c0700 -1 osd/ECBackend.cc: In function
'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp,
RecoveryMessages*)' thread 7f7ff26c0700 time 2014-07-30 10:12:09.596837
osd/ECBackend.cc: 529: FAILED assert(pop.data.length() ==
sinfo.aligned_logical_offset_to_chunk_offset(
after_progress.data_recovered_to -
op.recovery_progress.data_recovered_to))

 ceph version 0.83-383-g3cfda57
(3cfda577b15039cb5c678b79bef3e561df826ed1)
 1: 
(ECBackend::continue_recovery_op(ECBackend::RecoveryOp,RecoveryMessages*)+0x1a50)
 [0x928070]
 2: (ECBackend::handle_recovery_read_complete(hobject_t const,
boost::tuples::tupleunsigned long, unsigned long, std::mappg_shard_t,
ceph::buffer::list, std::lesspg_shard_t,
std::allocatorstd::pairpg_shard_t const, ceph::buffer::list  ,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::optionalstd::mapstd::string,
ceph::buffer::list, std::lessstd::string,
std::allocatorstd::pairstd::string const, ceph::buffer::list   ,
RecoveryMessages*)+0x90c) [0x92952c]
 3: (OnRecoveryReadComplete::finish(std::pairRecoveryMessages*,
ECBackend::read_result_t)+0x121) [0x938481]
 4: (GenContextstd::pairRecoveryMessages*,
ECBackend::read_result_t::complete(std::pairRecoveryMessages*,
ECBackend::read_result_t)+0x9) [0x929d69]
 5: (ECBackend::complete_read_op(ECBackend::ReadOp,RecoveryMessages*)+0x63) 
[0x91c6e3]
 6: (ECBackend::handle_sub_read_reply(pg_shard_t, 
ECSubReadReply,RecoveryMessages*)+0x96d) [0x920b4d]
 7: (ECBackend::handle_message(std::tr1::shared_ptrOpRequest)+0x17e)[0x92884e]
 8: 
(ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,ThreadPool::TPHandle)+0x23b)
 [0x7b34db]
 9: (OSD::dequeue_op(boost::intrusive_ptrPG,std::tr1::shared_ptrOpRequest, 
ThreadPool::TPHandle)+0x428)
[0x638d58]
 10: (OSD::ShardedOpWQ::_process(unsigned int,ceph::heartbeat_handle_d*)+0x346) 
[0x6392f6]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce)[0xa5caae]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa5ed00]
 13: (()+0x8182) [0x7f800b5d3182]
 14: (clone()+0x6d) [0x7f800997430d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

So we only get the get_recovery_chunk_size() at RecoverOp::READING and
record it using RecoveryOp::extent_requested.

Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 src/osd/ECBackend.cc | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/osd/ECBackend.cc b/src/osd/ECBackend.cc
index 2a3913a..e9402e8 100644
--- a/src/osd/ECBackend.cc
+++ b/src/osd/ECBackend.cc
@@ -476,6 +476,7 @@ void ECBackend::continue_recovery_op(
   assert(!op.recovery_progress.data_complete);
   setint want(op.missing_on_shards.begin(), op.missing_on_shards.end());
   setpg_shard_t to_read;
+  uint64_t recovery_max_chunk = get_recovery_chunk_size();
   int r = get_min_avail_to_read_shards(
op.hoid, want, true, to_read);
   if (r != 0) {
@@ -492,11 +493,11 @@ void ECBackend::continue_recovery_op(
this,
op.hoid,
op.recovery_progress.data_recovered_to,
-   get_recovery_chunk_size(),
+   recovery_max_chunk,
to_read,
op.recovery_progress.first);
   op.extent_requested = make_pair(op.recovery_progress.data_recovered_to,
- get_recovery_chunk_size());
+ recovery_max_chunk);
   dout(10)  __func__  : IDLE return   op  dendl;
   return;
 }
@@ -506,7 +507,7 @@ void ECBackend::continue_recovery_op(
   assert(op.returned_data.size());
   op.state = RecoveryOp::WRITING;
   ObjectRecoveryProgress after_progress = op.recovery_progress;
-  after_progress.data_recovered_to += get_recovery_chunk_size();
+  after_progress.data_recovered_to += op.extent_requested.second;
   after_progress.first = false;
   if (after_progress.data_recovered_to = op.obc-obs.oi.size) {
after_progress.data_recovered_to =
--
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH]ECBackend: Using ROUND_UP_TO to refactor get_recovery_chunk_size()

2014-07-30 Thread Ma, Jianpeng
Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 src/osd/ECBackend.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/osd/ECBackend.h b/src/osd/ECBackend.h
index 28bcf8a..0e96407 100644
--- a/src/osd/ECBackend.h
+++ b/src/osd/ECBackend.h
@@ -149,10 +149,8 @@ public:
 private:
   friend struct ECRecoveryHandle;
   uint64_t get_recovery_chunk_size() const {
-uint64_t max = cct-_conf-osd_recovery_max_chunk;
-max -= max % sinfo.get_stripe_width();
-max += sinfo.get_stripe_width();
-return max;
+return ROUND_UP_TO(cct-_conf-osd_recovery_max_chunk,
+   sinfo.get_stripe_width());
   }

   /**
--
1.9.1
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Pull Request for ISA EC plug-in

2014-07-30 Thread Ma, Jianpeng
Hi,
   At my machine, I also met this bug. But I modify this, it can work.
diff --git a/src/erasure-code/isa/Makefile.am b/src/erasure-code/isa/Makefile.am
index eebffea..dee593b 100644
--- a/src/erasure-code/isa/Makefile.am
+++ b/src/erasure-code/isa/Makefile.am
@@ -34,7 +34,7 @@ libec_isa_la_SOURCES = ${isa_sources}

 libec_isa_la_CFLAGS = ${AM_CFLAGS} -I $(srcdir)/erasure-code/isa/isa-l/include/
 libec_isa_la_CXXFLAGS = ${AM_CXXFLAGS} -I 
$(srcdir)/erasure-code/isa/isa-l/include/
-libec_isa_la_CCASFLAGS = ${AM_CCASFLAGS} -I 
$(srcdir)/erasure-code/isa/isa-l/include/
+libec_isa_la_CCASFLAGS = ${AM_CCASFLAGS} -i 
$(srcdir)/erasure-code/isa/isa-l/include

 libec_isa_la_LIBADD = $(LIBCRUSH) $(PTHREAD_LIBS) $(EXTRALIBS)
 libec_isa_la_LDFLAGS = ${AM_LDFLAGS} -version-info 2:10:0


As Andreas said, the different is -I and -i.
But how to choose? My yasm version is 1.2.0.

Thanks!



 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Thursday, July 31, 2014 7:02 AM
 To: Andreas Joachim Peters
 Cc: ceph-devel@vger.kernel.org
 Subject: RE: Pull Request for ISA EC plug-in
 
 On Wed, 30 Jul 2014, Andreas Joachim Peters wrote:
  Hi Sage,
 
  my guess it comes from the yasm include option which evt. is
  non-standard. Mine documents '-i' and '-I' while I found on the web
  that probably only '-I' is the portable one.
 
  I have changed that, still I can only guess, since I cannot reproduce.
 
 Hmm, how are you building it?  Running make on my work box (precise) also
 fails.  I modified yasm-wrapper slightly[1] and it's gets further, but still 
 no
 dice...
 
 Thanks!
 sage
 
 
 [1] https://github.com/ceph/ceph/commit/wip-ec-isa
 
 
 
  Please rerun.
 
  Thanks, Andreas.
 
  
  From: Sage Weil [sw...@redhat.com]
  Sent: 30 July 2014 19:46
  To: Andreas Joachim Peters
  Cc: ceph-devel@vger.kernel.org
  Subject: RE: Pull Request for ISA EC plug-in
 
  On Mon, 28 Jul 2014, Andreas Joachim Peters wrote:
   Hi Sage,
  
   I fixed that. I missed '$(srcdir)' in the assembler  and C/C++
   include statements (I always compiled in src dir ..)
  
   You can retry.
 
  I think there are still a few issues:
 
  make[4]: ***
  [erasure-code/isa/isa-l/erasure_code/libec_isa_la-ec_base.lo] Error 1
  make[4]: *** Waiting for unfinished jobs
  ./yasm-wrapper: yasm -f elf64 -i ./erasure-code/isa/isa-l/include/
  erasure-code/isa/isa-l/erasure_code/ec_multibinary.asm.s -o
  erasure-code/isa/isa-l/erasure_code/.libs/libec_isa_la-ec_multibinary.
  asm.o
  FATAL: yasm: unable to open include file `reg_sizes.asm'
  ./yasm-wrapper: yasm -f elf64 -i ./erasure-code/isa/isa-l/include/
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s -o
  erasure-code/isa/isa-l/erasure_code/.libs/libec_isa_la-gf_2vect_dot_pr
  od_avx2.asm.o
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:157:
  undefined symbol `gf_2vect_dot_prod_avx2.return_fail' (first use)
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:157:
  (Each undefined symbol is reported only once.)
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:161:
  instruction expected after label
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:178:
  instruction expected after label
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:179:
  instruction expected after label
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:183:
  instruction expected after label
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:184:
  instruction expected after label
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:206:
  undefined symbol `vperm2i128.next_vect' (first use)
  error:
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:213:
  undefined symbol `vperm2i128.loop32' (first use)
  make[4]: ***
  [erasure-code/isa/isa-l/erasure_code/libec_isa_la-ec_multibinary.asm.l
  o] Error 1
  make[4]: ***
  [erasure-code/isa/isa-l/erasure_code/libec_isa_la-gf_2vect_dot_prod_av
  x2.asm.lo] Error 1
  ./yasm-wrapper: yasm -f elf64 -i ./erasure-code/isa/isa-l/include/
  erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx.asm.s -o
  erasure-code/isa/isa-l/erasure_code/.libs/libec_isa_la-gf_2vect_dot_pr
  od_avx.asm.o In file included from
  erasure-code/isa/isa-l/erasure_code/ec_highlevel_func.c:30:0:
  ./erasure-code/isa/isa-l/include/erasure_code.h:52:25: fatal error:
  gf_vect_mul.h: No such file or directory compilation terminated.
 
  This is from
 
 http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-precise-amd64-basic/log.
 cgi?log=c7e765e22427958c0d03fbfec42a29aa34895ef3.
 
  Thanks!
  sage
 
 
 
 
  
   Thanks Andreas.
  
   
   From: ceph-devel-ow...@vger.kernel.org
   [ceph-devel-ow...@vger.kernel.org] on 

RE: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot.

2014-07-21 Thread Ma, Jianpeng
 Hrm, I'd really like to see the startup sequence. I see the crash occurring, 
 but I
 don't understand how it's happening — we test this pretty extensively so there
 must be something about your testing configuration that is different than 
 ours.
 Can you provide that part of the log, and maybe a little more description of
 what you think the problem is?

If the ceph.conf contain cluster addr,  the bug must occur.
For no cluster addr in ceph.conf, the local-connection add to fast-dispatch 
in func  _send_boot/ cluster_messenger-set_addr_unknowns.

 
 In particular, we *always* call init_local_connection when the messenger
 starts, so every messenger who is allowed to receive EC messages should have
 the local connection set up before they get one.
Yes you call init_local_connection. But only adding osd to messenger, the 
local_conenction can add to dispatch.
In func OSD::init
cluster_messenger-add_dispatcher_head(this);
Only after this, the local_connection can add to dispatch.
Because if local_connection has correct type, it can add to dispatch and don’t' 
care the cluster addr.
When allocate a Messenger, it set the type and only after 
add_dispatcher_head/tail, the local-connection can add to dispatch.
Maybe add ms_deliver_handle_fast_connect(local_connection.get())  in 
SimpleMessenger::ready is better.


Jianpeng Ma

 I don't really see how supplying the local connection as a new one in
 _send_boot *should* be fixing that, and it's not the place to do so (although 
 I
 guess it's doing *something*, I just can't figure out what).


 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Wed, Jul 16, 2014 at 5:17 PM, Ma, Jianpeng jianpeng...@intel.com
 wrote:
  Hi Greg,
 The attachment is the log.
 
  Thanks!
 
  -Original Message-
  From: Gregory Farnum [mailto:g...@inktank.com]
  Sent: Thursday, July 17, 2014 3:41 AM
  To: Ma, Jianpeng
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func
 _send_boot.
 
  I'm looking at this and getting a little confused. Can you provide a
  log of the crash occurring? (preferably with debug_ms=20,
  debug_osd=20)
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
  On Sun, Jul 13, 2014 at 8:17 PM, Ma, Jianpeng jianpeng...@intel.com
 wrote:
  When do ec-read, i met a bug which was occured 100%. The messages are:
  2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
  'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700
  time
  2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)
 
   ceph version 0.82-585-g79f3f67
  (79f3f6749122ce2944baa70541949d7ca75525e6)
   1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
   2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
   3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
   4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
   5: (()+0x8182) [0x7f7665670182]
   6: (clone()+0x6d) [0x7f7663a1130d]
   NOTE: a copy of the executable, or `objdump -rdS executable` is needed
 to interpret this.
 
  In commit 69fc6b2b66, it enable fast_dispatch on local connections
  and it will add local_connection to fast_dispatch in func
 init_local_connection.
  But if there is no fast-dispatch, the local connection can't add.
 
  If there is no clutser addr in ceph.conf, it will add
  local_connection to fast dispatch in func _send_boot because the
 cluster_addr is empty.
  But if there is cluster addr, local_connection can't add to fast dispatch.
 
  For ECSubRead, it send to itself by func send_message_osd_cluster so
  it will cause this bug.
 
  I don't know about hb_back/front_server_messenger. But they are in
  _send_boot like cluster_messenger, so i also modified those.
 
  Signed-off-by: Ma Jianpeng jianpeng...@intel.com
  ---
   src/osd/OSD.cc | 14 +++---
   1 file changed, 11 insertions(+), 3 deletions(-)
 
  diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 52a3839..75b294b
  100644
  --- a/src/osd/OSD.cc
  +++ b/src/osd/OSD.cc
  @@ -3852,29 +3852,37 @@ void OSD::_send_boot()  {
 dout(10)  _send_boot  dendl;
 entity_addr_t cluster_addr = cluster_messenger-get_myaddr();
  +  Connection *local_connection =
  + cluster_messenger-get_loopback_connection().get();
 if (cluster_addr.is_blank_ip()) {
   int port = cluster_addr.get_port();
   cluster_addr = client_messenger-get_myaddr();
   cluster_addr.set_port(port);
   cluster_messenger-set_addr_unknowns(cluster_addr);
   dout(10)   assuming cluster_addr ip matches client_addr 
  dendl;
  -  }
  +  } else if (local_connection-get_priv() == NULL)
  +
  + cluster_messenger-ms_deliver_handle_fast_connect(local_connection)
  + ;
  +
 entity_addr_t hb_back_addr =
  hb_back_server_messenger-get_myaddr();
  +  local_connection =
  + hb_back_server_messenger-get_loopback_connection().get();
 if (hb_back_addr.is_blank_ip()) {
   int port

[PATCH] os/FileJournal: When dump journal, using correctly last-seq avoid misjudging joural corrupt.

2014-07-21 Thread Ma, Jianpeng
In func FileJournal::dump, it always use seq=0 as last-seq and misjudge the 
journal corrupt.

Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 src/os/FileJournal.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc
index 7eb7927..3307655 100644
--- a/src/os/FileJournal.cc
+++ b/src/os/FileJournal.cc
@@ -564,9 +564,9 @@ int FileJournal::dump(ostream out)
   JSONFormatter f(true);

   f.open_array_section(journal);
+  uint64_t seq = 0;
   while (1) {
 bufferlist bl;
-uint64_t seq = 0;
 uint64_t pos = read_pos;
 if (!read_entry(bl, seq)) {
   dout(3)  journal_replay: end of journal, done.  dendl;
--
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot.

2014-07-16 Thread Ma, Jianpeng
Ping...

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Monday, July 14, 2014 11:17 AM
To: g...@inktank.com
Cc: ceph-devel@vger.kernel.org
Subject: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func 
_send_boot.

When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function 'virtual 
void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

In commit 69fc6b2b66, it enable fast_dispatch on local connections and it will 
add local_connection to fast_dispatch in func init_local_connection.
But if there is no fast-dispatch, the local connection can't add.

If there is no clutser addr in ceph.conf, it will add local_connection to fast 
dispatch in func _send_boot because the cluster_addr is empty.
But if there is cluster addr, local_connection can't add to fast dispatch.

For ECSubRead, it send to itself by func send_message_osd_cluster so it will 
cause this bug.

I don't know about hb_back/front_server_messenger. But they are in _send_boot 
like cluster_messenger, so i also modified those.

Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 src/osd/OSD.cc | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 52a3839..75b294b 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -3852,29 +3852,37 @@ void OSD::_send_boot()  {
   dout(10)  _send_boot  dendl;
   entity_addr_t cluster_addr = cluster_messenger-get_myaddr();
+  Connection *local_connection = 
+ cluster_messenger-get_loopback_connection().get();
   if (cluster_addr.is_blank_ip()) {
 int port = cluster_addr.get_port();
 cluster_addr = client_messenger-get_myaddr();
 cluster_addr.set_port(port);
 cluster_messenger-set_addr_unknowns(cluster_addr);
 dout(10)   assuming cluster_addr ip matches client_addr  dendl;
-  }
+  } else if (local_connection-get_priv() == NULL)
+  
+ cluster_messenger-ms_deliver_handle_fast_connect(local_connection);
+
   entity_addr_t hb_back_addr = hb_back_server_messenger-get_myaddr();
+  local_connection = 
+ hb_back_server_messenger-get_loopback_connection().get();
   if (hb_back_addr.is_blank_ip()) {
 int port = hb_back_addr.get_port();
 hb_back_addr = cluster_addr;
 hb_back_addr.set_port(port);
 hb_back_server_messenger-set_addr_unknowns(hb_back_addr);
 dout(10)   assuming hb_back_addr ip matches cluster_addr  dendl;
-  }
+  } else if (local_connection-get_priv() == NULL)
+  
+ hb_back_server_messenger-ms_deliver_handle_fast_connect(local_connect
+ ion);
+
   entity_addr_t hb_front_addr = hb_front_server_messenger-get_myaddr();
+  local_connection = 
+ hb_front_server_messenger-get_loopback_connection().get();
   if (hb_front_addr.is_blank_ip()) {
 int port = hb_front_addr.get_port();
 hb_front_addr = client_messenger-get_myaddr();
 hb_front_addr.set_port(port);
 hb_front_server_messenger-set_addr_unknowns(hb_front_addr);
 dout(10)   assuming hb_front_addr ip matches client_addr  dendl;
-  }
+  } else if (local_connection-get_priv() == NULL)
+  
+ hb_front_server_messenger-ms_deliver_handle_fast_connect(local_connec
+ tion);

   MOSDBoot *mboot = new MOSDBoot(superblock, service.get_boot_epoch(),
  hb_back_addr, hb_front_addr, cluster_addr);
--
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qa/workunits/erasure-code/bench.sh: Using correcting parameter name when call ceph_erasure_code_benchmark.

2014-07-15 Thread Ma, Jianpeng
In commit bcc74879a, loic modifed the parameter name. But he forgot to
modify the name in qa/workunits/erasure-code/bench.sh. Now modify and make
bench.sh work.

Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 qa/workunits/erasure-code/bench.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qa/workunits/erasure-code/bench.sh 
b/qa/workunits/erasure-code/bench.sh
index db7c490..483450b 100755
--- a/qa/workunits/erasure-code/bench.sh
+++ b/qa/workunits/erasure-code/bench.sh
@@ -50,9 +50,9 @@ function bench() {
 --iterations $iterations \
 --size $size \
 --erasures $erasures \
---parameter erasure-code-k=$k \
---parameter erasure-code-m=$m \
---parameter erasure-code-directory=$PLUGIN_DIRECTORY)
+--parameter k=$k \
+--parameter m=$m \
+--parameter directory=$PLUGIN_DIRECTORY)
 result=$($command $@)
 echo -e 
$result\t$plugin\t$k\t$m\t$workload\t$iterations\t$size\t$erasures\t$command 
$@
 }
--
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot.

2014-07-13 Thread Ma, Jianpeng
When do ec-read, i met a bug which was occured 100%. The messages are:
2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function
'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time
2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session)

 ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6)
 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6]
 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6]
 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b]
 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd]
 5: (()+0x8182) [0x7f7665670182]
 6: (clone()+0x6d) [0x7f7663a1130d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

In commit 69fc6b2b66, it enable fast_dispatch on local connections and
it will add local_connection to fast_dispatch in func init_local_connection.
But if there is no fast-dispatch, the local connection can't add.

If there is no clutser addr in ceph.conf, it will add local_connection
to fast dispatch in func _send_boot because the cluster_addr is empty.
But if there is cluster addr, local_connection can't add to fast dispatch.

For ECSubRead, it send to itself by func send_message_osd_cluster so it
will cause this bug.

I don't know about hb_back/front_server_messenger. But they are in
_send_boot like cluster_messenger, so i also modified those.

Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 src/osd/OSD.cc | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 52a3839..75b294b 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -3852,29 +3852,37 @@ void OSD::_send_boot()
 {
   dout(10)  _send_boot  dendl;
   entity_addr_t cluster_addr = cluster_messenger-get_myaddr();
+  Connection *local_connection = 
cluster_messenger-get_loopback_connection().get();
   if (cluster_addr.is_blank_ip()) {
 int port = cluster_addr.get_port();
 cluster_addr = client_messenger-get_myaddr();
 cluster_addr.set_port(port);
 cluster_messenger-set_addr_unknowns(cluster_addr);
 dout(10)   assuming cluster_addr ip matches client_addr  dendl;
-  }
+  } else if (local_connection-get_priv() == NULL)
+  cluster_messenger-ms_deliver_handle_fast_connect(local_connection);
+
   entity_addr_t hb_back_addr = hb_back_server_messenger-get_myaddr();
+  local_connection = hb_back_server_messenger-get_loopback_connection().get();
   if (hb_back_addr.is_blank_ip()) {
 int port = hb_back_addr.get_port();
 hb_back_addr = cluster_addr;
 hb_back_addr.set_port(port);
 hb_back_server_messenger-set_addr_unknowns(hb_back_addr);
 dout(10)   assuming hb_back_addr ip matches cluster_addr  dendl;
-  }
+  } else if (local_connection-get_priv() == NULL)
+  
hb_back_server_messenger-ms_deliver_handle_fast_connect(local_connection);
+
   entity_addr_t hb_front_addr = hb_front_server_messenger-get_myaddr();
+  local_connection = 
hb_front_server_messenger-get_loopback_connection().get();
   if (hb_front_addr.is_blank_ip()) {
 int port = hb_front_addr.get_port();
 hb_front_addr = client_messenger-get_myaddr();
 hb_front_addr.set_port(port);
 hb_front_server_messenger-set_addr_unknowns(hb_front_addr);
 dout(10)   assuming hb_front_addr ip matches client_addr  dendl;
-  }
+  } else if (local_connection-get_priv() == NULL)
+  
hb_front_server_messenger-ms_deliver_handle_fast_connect(local_connection);

   MOSDBoot *mboot = new MOSDBoot(superblock, service.get_boot_epoch(),
  hb_back_addr, hb_front_addr, cluster_addr);
--
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] ReplicatedPG: For async-read, set the real result after completing read.

2014-07-13 Thread Ma, Jianpeng
Hi Sam,
   I'm a newbie and I don’t have my ceph repo in github. So I can't send a full 
request.

Jianpeng Ma
Thanks!

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com] 
Sent: Saturday, July 12, 2014 1:21 AM
To: Ma, Jianpeng
Cc: ceph-devel@vger.kernel.org
Subject: Re: [PATCH] ReplicatedPG: For async-read, set the real result after 
completing read.

Looks about right, file a pull request?
-Sam

On Thu, Jul 10, 2014 at 2:23 AM, Ma, Jianpeng jianpeng...@intel.com wrote:
 When reading an object from replicated pool, ceph uses sync mode, so it can 
 set the results in execute_ctx correctly.
 However, For the async-read in EC Pool, current code didn't set the real 
 results after read in complete_read_ctx.

 Signed-off-by: Ma Jianpeng jianpeng...@intel.com
 ---
  src/osd/ReplicatedPG.cc | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index 
 cab3fde..ac0e8ad 100644
 --- a/src/osd/ReplicatedPG.cc
 +++ b/src/osd/ReplicatedPG.cc
 @@ -5318,7 +5318,7 @@ void ReplicatedPG::complete_read_ctx(int result, 
 OpContext *ctx)
  // on ENOENT, set a floor for what the next user version will be.
  reply-set_enoent_reply_versions(info.last_update, 
 info.last_user_version);
}
 -
 +  reply-set_result(result);
reply-add_flags(CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK);
osd-send_message_osd_client(reply, m-get_connection());
close_op_ctx(ctx, 0);
 --
 1.9.1


[PATCH] ReplicatedPG: For async-read, set the real result after completing read.

2014-07-10 Thread Ma, Jianpeng
When reading an object from replicated pool, ceph uses sync mode, so it can set 
the results in execute_ctx correctly. 
However, For the async-read in EC Pool, current code didn't set the real 
results after read in complete_read_ctx.

Signed-off-by: Ma Jianpeng jianpeng...@intel.com
---
 src/osd/ReplicatedPG.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index cab3fde..ac0e8ad 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -5318,7 +5318,7 @@ void ReplicatedPG::complete_read_ctx(int result, 
OpContext *ctx)
 // on ENOENT, set a floor for what the next user version will be.
 reply-set_enoent_reply_versions(info.last_update, info.last_user_version);
   }
-
+  reply-set_result(result);
   reply-add_flags(CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK);
   osd-send_message_osd_client(reply, m-get_connection());
   close_op_ctx(ctx, 0);
--
1.9.1
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rbd.cc: Check io-size avoid floating point exception.

2014-06-12 Thread Ma, Jianpeng
In func do_bench_write if io_size is zero,it can cause floating point execption.

Signed-off-by: Jianpeng Ma jianpeng...@intel.com
---
 src/rbd.cc | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/rbd.cc b/src/rbd.cc
index d6658e3..c068ed6 100644
--- a/src/rbd.cc
+++ b/src/rbd.cc
@@ -2038,6 +2038,14 @@ int main(int argc, const char **argv)
return EXIT_FAILURE;
   }
 } else if (ceph_argparse_withlonglong(args, i, bench_io_size, err, 
--io-size, (char*)NULL)) {
+  if (!err.str().empty()) {
+   cerr  rbd:   err.str()  std::endl;
+   return EXIT_FAILURE;
+  }
+  if (bench_io_size == 0) {
+   cerr  rbd: io-size must be  0  std::endl;
+   return EXIT_FAILURE;
+  }
 } else if (ceph_argparse_withlonglong(args, i, bench_io_threads, err, 
--io-threads, (char*)NULL)) {
 } else if (ceph_argparse_withlonglong(args, i, bench_bytes, err, 
--io-total, (char*)NULL)) {
 } else if (ceph_argparse_witharg(args, i, bench_pattern, err, 
--io-pattern, (char*)NULL)) {
-- 
1.9.1
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html