RE: format 2TB rbd device is too slow
Ilya I modify kernel code. The patch like this: [root@dev linux]# git diff diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bc67a93..e4c4ea9 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -55,7 +55,7 @@ * universally 512 bytes. These symbols are just slightly more * meaningful than the bare numbers they represent. */ -#defineSECTOR_SHIFT9 +#defineSECTOR_SHIFT12 #defineSECTOR_SIZE (1ULL SECTOR_SHIFT) /* @@ -3811,6 +3811,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) blk_queue_io_min(q, segment_size); blk_queue_io_opt(q, segment_size); + blk_queue_physical_block_size(q, SECTOR_SIZE); /* enable the discard support */ queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); q-limits.discard_granularity = segment_size; I don't know the exact why this can decrease the time. For the -b of mkfs.xfs, I try for 65536. But it don't change any. Thanks! -Original Message- From: Ilya Dryomov [mailto:idryo...@gmail.com] Sent: Friday, August 28, 2015 5:23 PM To: Ma, Jianpeng Cc: huang jun; Haomai Wang; ceph-devel Subject: Re: format 2TB rbd device is too slow On Fri, Aug 28, 2015 at 10:36 AM, Ma, Jianpeng jianpeng...@intel.com wrote: Hi Ilya, We can change sector size from 512 to 4096. This can reduce the count of write. I did a simple test: for 900G, mkfs.xfs -f For default: 1m10s Physical sector size = 4096: 0m10s. But if change sector size, we need rbd meta record this. What exactly do you mean by changing the sector size? xfs sector size or physical block size of the rbd device? If the latter, meaning mkfs.xfs -s size 4096, I fail to see how it can result in the above numbers. If the former, you have to patch the kernel and I fail to see how it can result in such an improvement too. I probably didn't have enough coffee, please elaborate. Thanks, Ilya N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: format 2TB rbd device is too slow
Hi Ilya, We can change sector size from 512 to 4096. This can reduce the count of write. I did a simple test: for 900G, mkfs.xfs -f For default: 1m10s Physical sector size = 4096: 0m10s. But if change sector size, we need rbd meta record this. Thanks! Jianpeng -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of huang jun Sent: Thursday, August 27, 2015 8:44 AM To: Ilya Dryomov Cc: Haomai Wang; ceph-devel Subject: Re: format 2TB rbd device is too slow hi,llya 2015-08-26 23:56 GMT+08:00 Ilya Dryomov idryo...@gmail.com: On Wed, Aug 26, 2015 at 6:22 PM, Haomai Wang haomaiw...@gmail.com wrote: On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote: hi,all we create a 2TB rbd image, after map it to local, then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318 seconds to finish, but local physical disk with the same size just need 6 seconds. I think librbd have two PR related to this. After debug, we found there are two steps in rbd module during formating: a) send 233093 DELETE requests to osds(number_of_requests = 2TB / 4MB), this step spent almost 92 seconds. I guess this(https://github.com/ceph/ceph/pull/4221/files) may help It's submitting deletes for non-existent objects, not zeroing. The only thing that will really help here is the addition of rbd object map support to the kernel client. That could happen in 4.4, but 4.5 is a safer bet. b) send 4238 messages like this: [set-alloc-hint object_size 4194304 write_size 4194304,write 0~512] to osds, that spent 227 seconds. I think kernel rbd also need to use https://github.com/ceph/ceph/pull/4983/files set-alloc-hint may be a problem, but I think a bigger problem is the size of the write. Are all those writes 512 bytes long? In another test to format 2TB rbd device, there are : 2 messages,each write 131072 bytes 4000 messages, each write 262144 bytes 112 messages, each write 4096 bytes 194 messages, each write 512 bytes the xfs info: meta-data = /dev/rbd/rbd/test2tisize=256agcount=33, agsize=16382976 blks =sectsz=512 attr=2, projid32bit=1 =crc=0 data = bsize=4096 blocks=524288000, imaxpct=5 = sunit=1024 swidth=1024 blks naming=version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=256000, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Thanks, Ilya -- thanks huangjun -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] fs/ceph: No need get inode of parent in ceph_open.
Thanks very much zheng. I'll pay more attention to the format of patch. Jianpeng -Original Message- From: Yan, Zheng [mailto:z...@redhat.com] Sent: Tuesday, August 18, 2015 10:36 AM To: Ma, Jianpeng Cc: Sage Weil; ceph-devel@vger.kernel.org Subject: Re: [PATCH] fs/ceph: No need get inode of parent in ceph_open. On Aug 17, 2015, at 16:43, Ma, Jianpeng jianpeng...@intel.com wrote: For ceph_open, it already get the parent inode. So no need to get again. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..5ecd3dc 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) - parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); - iput(parent_inode); if (!err) err = ceph_init_file(inode, file, req-r_fmode); ceph_mdsc_put_request(req); — I fixed your patches (all tabs are replaced by spaces in your patches) and added them to our testing branch Thanks Yan, Zheng 2.4.3
[PATCH] fs/ceph: No need get inode of parent in ceph_open.
For ceph_open, it already get the parent inode. So no need to get again. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..5ecd3dc 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) - parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); - iput(parent_inode); if (!err) err = ceph_init_file(inode, file, req-r_fmode); ceph_mdsc_put_request(req); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs/ceph: cleanup code.
Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/mds_client.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 6aa07af..0c65d77 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2520,8 +2520,7 @@ out_err: if (err) { req-r_err = err; } else { - req-r_reply = msg; - ceph_msg_get(msg); + req-r_reply = ceph_msg_get(msg); req-r_got_result = true; } } else { -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH ] fs/ceph: Remove the useless judgement.
For ret != 0, it already handle. So skip this. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 5ecd3dc..17cc357 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -276,7 +276,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry, if (err) goto out_req; - if (err == 0 (flags O_CREAT) !req-r_reply_info.head-is_dentry) + if ((flags O_CREAT) !req-r_reply_info.head-is_dentry) err = ceph_handle_notrace_create(dir, dentry); if (d_unhashed(dentry)) { -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: Using file-f_flags rather than flags check O_CREAT.
Because flags removed the O_CREAT, so we should use file-f_flags. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..e1347cf 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,7 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) + if (file-f_flags O_CREAT) parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); iput(parent_inode); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Questions]Can client know which OSDs are storing the data?
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Dennis Chen Sent: Tuesday, January 27, 2015 3:07 PM To: ceph-devel@vger.kernel.org; Dennis Chen Subject: [Questions]Can client know which OSDs are storing the data? Hello Guys, My question is very rude and direct, a little bit stupid maybe ;-) Question a: a client write a file to the cluster (supposing replica = 3), so the data will be stored in 3 OSDs within the cluster, can I get the information of which OSDs storing the file data in client side? Ceph osd map poolname objectname Display the osd and pg which store object. Question b: can the object data still be replicated if I store a object from client with RADOS API? Yes, rados api is also a client. Thank you guys ! -- Den -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: Ceph data consistency
-- Forwarded message -- From: Paweł Sadowski c...@sadziu.pl Date: 2014-12-30 21:40 GMT+08:00 Subject: Re: Ceph data consistency To: Vijayendra Shamanna vijayendra.shama...@sandisk.com, ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote: Hi, There is a sync thread (sync_entry in FileStore.cc) which triggers periodically and executes sync_filesystem() to ensure that the data is consistent. The journal entries are trimmed only after a successful sync_filesystem() call sync_filesystem() always returns zero and journal will be trimmed. Executing sync()/syncfs() with dirty data in disk buffers will result in data loss (lost page write due to I/O error). Hi sage: From the git log, I see at first sync_filesystem() return the result of syncfs(). But in this commit 808c644248e486f44: Improve use of syncfs. Test syncfs return value and fallback to btrfs sync and then sync. The author hope if syncfs() met error and sync() can resolve. Because sync() don't return result So it only return zero. But which error can handle by this way? AFAK, no. I suggest it directly return result of syncfs(). Jianpeng Ma Thanks! I was doing some experiments simulating disk errors using Device Mapper error target. In this setup OSD was writing to broken disk without crashing. Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were discarded due to IO error. Thanks Viju -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Pawel Sadowski Sent: Tuesday, December 30, 2014 1:52 PM To: ceph-devel@vger.kernel.org Subject: Ceph data consistency Hi, On our Ceph cluster from time to time we have some inconsistent PGs (after deep-scrub). We have some issues with disk/sata cables/lsi controller causing IO errors from time to time (but that's not the point in this case). When IO error occurs on OSD journal partition everything works as is should - OSD is crashed and that's ok - Ceph will handle that. But when IO error occurs on OSD data partition during journal flush OSD continue to work. After calling *writev* (in buffer::list::write_fd) OSD does check return code from this call but does NOT verify if write has been successful to disk (data are still only in memory and there is no fsync). That way OSD thinks that data has been stored on disk but it might be discarded (during sync dirty page will be reclaimed and you'll see lost page write due to I/O error in dmesg). Since there is no checksumming of data I just wanted to make sure that this is by design. Maybe there is a way to tell OSD to call fsync after write and have data consistent? -- PS -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Ceph data consistency
Hi Sage, Pull request is https://github.com/ceph/ceph/pull/3305. Thanks! Jianpeng Ma -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, January 7, 2015 10:18 AM To: Ma, Jianpeng Cc: c...@sadziu.pl; vijayendra.shama...@sandisk.com; ceph-devel@vger.kernel.org Subject: RE: Ceph data consistency On Wed, 7 Jan 2015, Ma, Jianpeng wrote: -- Forwarded message -- From: Pawe? Sadowski c...@sadziu.pl Date: 2014-12-30 21:40 GMT+08:00 Subject: Re: Ceph data consistency To: Vijayendra Shamanna vijayendra.shama...@sandisk.com, ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote: Hi, There is a sync thread (sync_entry in FileStore.cc) which triggers periodically and executes sync_filesystem() to ensure that the data is consistent. The journal entries are trimmed only after a successful sync_filesystem() call sync_filesystem() always returns zero and journal will be trimmed. Executing sync()/syncfs() with dirty data in disk buffers will result in data loss (lost page write due to I/O error). Hi sage: From the git log, I see at first sync_filesystem() return the result of syncfs(). But in this commit 808c644248e486f44: Improve use of syncfs. Test syncfs return value and fallback to btrfs sync and then sync. The author hope if syncfs() met error and sync() can resolve. Because sync() don't return result So it only return zero. But which error can handle by this way? AFAK, no. I suggest it directly return result of syncfs(). Yeah, that sounds right! sage Jianpeng Ma Thanks! I was doing some experiments simulating disk errors using Device Mapper error target. In this setup OSD was writing to broken disk without crashing. Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were discarded due to IO error. Thanks Viju -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Pawel Sadowski Sent: Tuesday, December 30, 2014 1:52 PM To: ceph-devel@vger.kernel.org Subject: Ceph data consistency Hi, On our Ceph cluster from time to time we have some inconsistent PGs (after deep-scrub). We have some issues with disk/sata cables/lsi controller causing IO errors from time to time (but that's not the point in this case). When IO error occurs on OSD journal partition everything works as is should - OSD is crashed and that's ok - Ceph will handle that. But when IO error occurs on OSD data partition during journal flush OSD continue to work. After calling *writev* (in buffer::list::write_fd) OSD does check return code from this call but does NOT verify if write has been successful to disk (data are still only in memory and there is no fsync). That way OSD thinks that data has been stored on disk but it might be discarded (during sync dirty page will be reclaimed and you'll see lost page write due to I/O error in dmesg). Since there is no checksumming of data I just wanted to make sure that this is by design. Maybe there is a way to tell OSD to call fsync after write and have data consistent? -- PS -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N?r??y??X???v???)?{.n?z?]zay?j ??f???h??w? ?? ???j:+v???w zZ+???ji -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Latency Improvement Report for ShardedOpWQ
Hi Somnath: You mentioned: There is still one global lock we have; this is to protect pg_for_processing() and this we can't get rid of since we need to maintain op order within a pg. But for most object operations, we only maintain the order of object. Why need maintain op order within a pg? Can you explain in detail? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Sunday, September 28, 2014 5:02 PM To: Dong Yuan Cc: ceph-devel Subject: RE: Latency Improvement Report for ShardedOpWQ Dong, This is mostly because of lock contention may be. You can tweak the number of shards in case of sharded WQ to see if it is improving this number or not. There is still one global lock we have; this is to protect pg_for_processing() and this we can't get rid of since we need to maintain op order within a pg. This could be increasing latency as well. I would suggest you to measure this number in different stages within ShardedOpWQ::_process() like after dequeue from pqueue and after getting the pglock and popping the ops from pg_for_processing(). Also, keep in mind there is context switch happening and this could be expensive depending on the data copy etc. It's worth trying this experiment by pinning OSD to may be actual physical cores ? Thanks Regards Somnath -Original Message- From: Dong Yuan [mailto:yuandong1...@gmail.com] Sent: Sunday, September 28, 2014 12:19 AM To: Somnath Roy Cc: ceph-devel Subject: Re: Latency Improvement Report for ShardedOpWQ Hi Somnath, I totally agree with you. I read the code about sharded TP and the new OSD OpWQ. In the new implementation, there is not single lock for all PGs, but each lock for a subset of PGs(Am I right?). It is very useful to reduce lock contention and so increase parallelism. It is an awesome work! While I am working on the latency of single IO (mainly 4K random write), I notice the OpWQ spent about 100+us to transfer an IO from msg dispatcher to OpWQ worker thread, Do you have any idea to reduce the time span? Thanks for your help. Dong. On 28 September 2014 13:46, Somnath Roy somnath@sandisk.com wrote: Hi Dong, I don't think in case of single client scenario there is much benefit. Single client has a limitation. The benefit with sharded TP is, a single OSD is scaling much more with the increase of clients since it is increasing parallelism (by reducing lock contention) in the filestore level. A quick check could be like this. 1. Create a single node, single OSD cluster and try putting load with increasing number of clients like 1,3, 5, 8,10. Small workload serving from memory should be ideal. 2. Compare the code with sharded TP against say firefly. You should be seeing firefly is not scaling with increasing number of clients. 3. try top -H on two different case and you should be seeing more threads in case of sharded tp were working in parallel than firefly. Also, I am sure this latency result will not hold true in high workload , there you should be seeing more contention and as a result more latency. Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Dong Yuan Sent: Saturday, September 27, 2014 8:45 PM To: ceph-devel Subject: Latency Improvement Report for ShardedOpWQ = Test Purpose = Measure whether and how much Sharded OpWQ is better than Traditional OpWQ for random write scene. = Test Case = 4K Object WriteFull for 1w times. = Test Method = Put the following static probes into codes when running tests to get the time span between enqeueue and dequeue of OpWQ. Start: PG::enqueue_op before osd-op_wq.equeue call End: OSD::dequeue_op.entry = Test Result = Traditional OpWQ: 109us(AVG), 40us(MIN) ShardedOpWQ: 97us(AVG), 32us(MIN) = Test Conclusion = No Remarkably Improvement for Latency -- Dong Yuan Email:yuandong1...@gmail.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard
RE: [PATCH 2/3] ec: make use of added aligned buffers
-Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Tuesday, September 16, 2014 2:47 PM To: Ma, Jianpeng Cc: ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers On 16/09/2014 02:08, Ma, Jianpeng wrote: -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, September 16, 2014 8:02 AM To: Ma, Jianpeng Cc: Loic Dachary; Janne Grunau; ceph-devel@vger.kernel.org Subject: RE: [PATCH 2/3] ec: make use of added aligned buffers On Mon, 15 Sep 2014, Ma, Jianpeng wrote: If we modify bufferlist::c_str() to bufferlist::c_str(bool align). If (align) Posix_memalign(data, CEPH_PAGE_SIZE, len) Else Origin code. Alignment isn't really a bool, it's an int. c_str(int align=1) ? I mean if we need a align memory after bufferlist::c_str. We can set the align = true; Now bufferlist::c_str depend on the size when using rebuild if bufferlist have more prt. BTW, we also can change the rebuild() rebuild(ptr). Merge two func into one rebuild(bool align). By judge the parameter align to alloc align memory or not. Or am I missing something? I don't think there is a need for c_str(int align). We make every effort to allocate buffers that are properly aligned. If c_str() does not return an aligned buffer, the proper fix is to align the allocated buffer at the source, not to allocate a new aligned buffer and copy the content of the non aligned buffer into it. Do you see a reason why that would be a bad way to deal with alignment ? We only guarantee memory align after c_str and don't depend on the previous steps. If encode[i] have more ptr suppose they all aligned memory. But how to guarantee aligned memory after c_str. If bufferlist have more ptr, the aligned memory depend on the size of bufferlist. Jianpeng Cheers Jianpeng sage I think this is simple and correctly code. Jianpeng Thanks! -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Tuesday, September 16, 2014 1:20 AM To: Janne Grunau; ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers Hi Janne, See below: On 15/09/2014 17:55, Janne Grunau wrote: Requiring page aligned buffers and realigning the input if necessary creates measurable oberhead. ceph_erasure_code_benchmark is ~30% faster with this change for technique=reed_sol_van,k=2,m=1. Also prevents a misaligned buffer when bufferlist::c_str(bufferlist) has to allocate a new buffer to provide continuous one. See bug #9408 Signed-off-by: Janne Grunau j...@jannau.net --- src/erasure-code/ErasureCode.cc | 46 + src/erasure-code/ErasureCode.h | 3 ++- 2 files changed, 30 insertions(+), 19 deletions(-) diff --git a/src/erasure-code/ErasureCode.cc b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644 --- a/src/erasure-code/ErasureCode.cc +++ b/src/erasure-code/ErasureCode.cc @@ -54,22 +54,38 @@ int ErasureCode::minimum_to_decode_with_cost(const setint want_to_read, } int ErasureCode::encode_prepare(const bufferlist raw, -bufferlist *prepared) const +mapint, bufferlist encoded) const { unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; unsigned blocksize = get_chunk_size(raw.length()); - unsigned padded_length = blocksize * k; - *prepared = raw; - if (padded_length - raw.length() 0) { -bufferptr pad(padded_length - raw.length()); -pad.zero(); -prepared-push_back(pad); + unsigned pad_len = blocksize * k - raw.length(); + + bufferlist prepared = raw; + + if (!prepared.is_aligned()) { +prepared.rebuild_aligned(); + } + + for (unsigned int i = 0; i k - !!pad_len; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.substr_of(prepared, i * blocksize, blocksize); } It is possible for more than one chunk to be padding. It's a border case but... for instance with alignment = 16, k=12 and in of length 1550 you end up with two padding chunks because the blocksize is 144. + if (pad_len 0) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[k + - 1] : k - 1; +bufferlist chunk = encoded[chunk_index]; +bufferptr padded(buffer::create_aligned(blocksize)); +raw.copy((k - 1) * blocksize, blocksize - pad_len, padded.c_str()); +padded.zero(blocksize - pad_len, pad_len); +chunk.push_back(padded); } - unsigned coding_length = blocksize * m; - bufferptr coding(buffer::create_page_aligned(coding_length)); - prepared-push_back(coding); - prepared-rebuild_page_aligned
RE: [PATCH 2/3] ec: make use of added aligned buffers
I see it .Thanks very much! Jianpeng -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Tuesday, September 16, 2014 3:55 PM To: Ma, Jianpeng Cc: ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers On 16/09/2014 08:59, Ma, Jianpeng wrote: -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Tuesday, September 16, 2014 2:47 PM To: Ma, Jianpeng Cc: ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers On 16/09/2014 02:08, Ma, Jianpeng wrote: -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, September 16, 2014 8:02 AM To: Ma, Jianpeng Cc: Loic Dachary; Janne Grunau; ceph-devel@vger.kernel.org Subject: RE: [PATCH 2/3] ec: make use of added aligned buffers On Mon, 15 Sep 2014, Ma, Jianpeng wrote: If we modify bufferlist::c_str() to bufferlist::c_str(bool align). If (align) Posix_memalign(data, CEPH_PAGE_SIZE, len) Else Origin code. Alignment isn't really a bool, it's an int. c_str(int align=1) ? I mean if we need a align memory after bufferlist::c_str. We can set the align = true; Now bufferlist::c_str depend on the size when using rebuild if bufferlist have more prt. BTW, we also can change the rebuild() rebuild(ptr). Merge two func into one rebuild(bool align). By judge the parameter align to alloc align memory or not. Or am I missing something? I don't think there is a need for c_str(int align). We make every effort to allocate buffers that are properly aligned. If c_str() does not return an aligned buffer, the proper fix is to align the allocated buffer at the source, not to allocate a new aligned buffer and copy the content of the non aligned buffer into it. Do you see a reason why that would be a bad way to deal with alignment ? We only guarantee memory align after c_str and don't depend on the previous steps. This is a benefit indeed: less room for error in general. If encode[i] have more ptr suppose they all aligned memory. But how to guarantee aligned memory after c_str. If bufferlist have more ptr, the aligned memory depend on the size of bufferlist. The ErasureCode::encode_prepare prepares the allocated buffer so that *) it holds enough room to contain all chunks *) c_str() on a chunk boundary will return a pointer that does not require allocating memory The ErasureCodeJerasure::get_chunk_size function determines the size of the buffer allocated by encode_prepare and further guarantees that each chunk is: *) size aligned on 16 bytes *) memory aligned on 16 bytes so that SIMD instructions can use it without moving it In other words, if c_str() had to re-allocate the buffer because it is not aligned, it would mean that the allocation of the buffer is not done as it should. Cheers Jianpeng Cheers Jianpeng sage I think this is simple and correctly code. Jianpeng Thanks! -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Tuesday, September 16, 2014 1:20 AM To: Janne Grunau; ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers Hi Janne, See below: On 15/09/2014 17:55, Janne Grunau wrote: Requiring page aligned buffers and realigning the input if necessary creates measurable oberhead. ceph_erasure_code_benchmark is ~30% faster with this change for technique=reed_sol_van,k=2,m=1. Also prevents a misaligned buffer when bufferlist::c_str(bufferlist) has to allocate a new buffer to provide continuous one. See bug #9408 Signed-off-by: Janne Grunau j...@jannau.net --- src/erasure-code/ErasureCode.cc | 46 + src/erasure-code/ErasureCode.h | 3 ++- 2 files changed, 30 insertions(+), 19 deletions(-) diff --git a/src/erasure-code/ErasureCode.cc b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644 --- a/src/erasure-code/ErasureCode.cc +++ b/src/erasure-code/ErasureCode.cc @@ -54,22 +54,38 @@ int ErasureCode::minimum_to_decode_with_cost(const setint want_to_read, } int ErasureCode::encode_prepare(const bufferlist raw, -bufferlist *prepared) const +mapint, bufferlist encoded) const { unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; unsigned blocksize = get_chunk_size(raw.length()); - unsigned padded_length = blocksize * k; - *prepared = raw; - if (padded_length - raw.length() 0) { -bufferptr pad(padded_length - raw.length()); -pad.zero(); -prepared-push_back(pad); + unsigned pad_len = blocksize * k - raw.length(); + + bufferlist prepared = raw
RE: [PATCH 2/3] ec: make use of added aligned buffers
If we modify bufferlist::c_str() to bufferlist::c_str(bool align). If (align) Posix_memalign(data, CEPH_PAGE_SIZE, len) Else Origin code. I think this is simple and correctly code. Jianpeng Thanks! -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Tuesday, September 16, 2014 1:20 AM To: Janne Grunau; ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers Hi Janne, See below: On 15/09/2014 17:55, Janne Grunau wrote: Requiring page aligned buffers and realigning the input if necessary creates measurable oberhead. ceph_erasure_code_benchmark is ~30% faster with this change for technique=reed_sol_van,k=2,m=1. Also prevents a misaligned buffer when bufferlist::c_str(bufferlist) has to allocate a new buffer to provide continuous one. See bug #9408 Signed-off-by: Janne Grunau j...@jannau.net --- src/erasure-code/ErasureCode.cc | 46 + src/erasure-code/ErasureCode.h | 3 ++- 2 files changed, 30 insertions(+), 19 deletions(-) diff --git a/src/erasure-code/ErasureCode.cc b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644 --- a/src/erasure-code/ErasureCode.cc +++ b/src/erasure-code/ErasureCode.cc @@ -54,22 +54,38 @@ int ErasureCode::minimum_to_decode_with_cost(const setint want_to_read, } int ErasureCode::encode_prepare(const bufferlist raw, -bufferlist *prepared) const +mapint, bufferlist encoded) const { unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; unsigned blocksize = get_chunk_size(raw.length()); - unsigned padded_length = blocksize * k; - *prepared = raw; - if (padded_length - raw.length() 0) { -bufferptr pad(padded_length - raw.length()); -pad.zero(); -prepared-push_back(pad); + unsigned pad_len = blocksize * k - raw.length(); + + bufferlist prepared = raw; + + if (!prepared.is_aligned()) { +prepared.rebuild_aligned(); + } + + for (unsigned int i = 0; i k - !!pad_len; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.substr_of(prepared, i * blocksize, blocksize); } It is possible for more than one chunk to be padding. It's a border case but... for instance with alignment = 16, k=12 and in of length 1550 you end up with two padding chunks because the blocksize is 144. + if (pad_len 0) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[k - 1] : k - 1; +bufferlist chunk = encoded[chunk_index]; +bufferptr padded(buffer::create_aligned(blocksize)); +raw.copy((k - 1) * blocksize, blocksize - pad_len, padded.c_str()); +padded.zero(blocksize - pad_len, pad_len); +chunk.push_back(padded); } - unsigned coding_length = blocksize * m; - bufferptr coding(buffer::create_page_aligned(coding_length)); - prepared-push_back(coding); - prepared-rebuild_page_aligned(); + for (unsigned int i = k; i k + m; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.push_back(buffer::create_aligned(blocksize)); + } + return 0; } @@ -80,15 +96,9 @@ int ErasureCode::encode(const setint want_to_encode, unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; bufferlist out; - int err = encode_prepare(in, out); + int err = encode_prepare(in, *encoded); if (err) return err; - unsigned blocksize = get_chunk_size(in.length()); - for (unsigned int i = 0; i k + m; i++) { -int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; -bufferlist chunk = (*encoded)[chunk_index]; -chunk.substr_of(out, i * blocksize, blocksize); - } encode_chunks(want_to_encode, encoded); for (unsigned int i = 0; i k + m; i++) { if (want_to_encode.count(i) == 0) diff --git a/src/erasure-code/ErasureCode.h b/src/erasure-code/ErasureCode.h index 7aaea95..62aa383 100644 --- a/src/erasure-code/ErasureCode.h +++ b/src/erasure-code/ErasureCode.h @@ -46,7 +46,8 @@ namespace ceph { const mapint, int available, setint *minimum); -int encode_prepare(const bufferlist raw, bufferlist *prepared) const; +int encode_prepare(const bufferlist raw, + mapint, bufferlist encoded) const; virtual int encode(const setint want_to_encode, const bufferlist in, -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
RE: [PATCH 2/3] ec: make use of added aligned buffers
-Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, September 16, 2014 8:02 AM To: Ma, Jianpeng Cc: Loic Dachary; Janne Grunau; ceph-devel@vger.kernel.org Subject: RE: [PATCH 2/3] ec: make use of added aligned buffers On Mon, 15 Sep 2014, Ma, Jianpeng wrote: If we modify bufferlist::c_str() to bufferlist::c_str(bool align). If (align) Posix_memalign(data, CEPH_PAGE_SIZE, len) Else Origin code. Alignment isn't really a bool, it's an int. c_str(int align=1) ? I mean if we need a align memory after bufferlist::c_str. We can set the align = true; Now bufferlist::c_str depend on the size when using rebuild if bufferlist have more prt. BTW, we also can change the rebuild() rebuild(ptr). Merge two func into one rebuild(bool align). By judge the parameter align to alloc align memory or not. Or am I missing something? Jianpeng sage I think this is simple and correctly code. Jianpeng Thanks! -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Tuesday, September 16, 2014 1:20 AM To: Janne Grunau; ceph-devel@vger.kernel.org Subject: Re: [PATCH 2/3] ec: make use of added aligned buffers Hi Janne, See below: On 15/09/2014 17:55, Janne Grunau wrote: Requiring page aligned buffers and realigning the input if necessary creates measurable oberhead. ceph_erasure_code_benchmark is ~30% faster with this change for technique=reed_sol_van,k=2,m=1. Also prevents a misaligned buffer when bufferlist::c_str(bufferlist) has to allocate a new buffer to provide continuous one. See bug #9408 Signed-off-by: Janne Grunau j...@jannau.net --- src/erasure-code/ErasureCode.cc | 46 + src/erasure-code/ErasureCode.h | 3 ++- 2 files changed, 30 insertions(+), 19 deletions(-) diff --git a/src/erasure-code/ErasureCode.cc b/src/erasure-code/ErasureCode.cc index 5953f49..078f60b 100644 --- a/src/erasure-code/ErasureCode.cc +++ b/src/erasure-code/ErasureCode.cc @@ -54,22 +54,38 @@ int ErasureCode::minimum_to_decode_with_cost(const setint want_to_read, } int ErasureCode::encode_prepare(const bufferlist raw, -bufferlist *prepared) const +mapint, bufferlist encoded) const { unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; unsigned blocksize = get_chunk_size(raw.length()); - unsigned padded_length = blocksize * k; - *prepared = raw; - if (padded_length - raw.length() 0) { -bufferptr pad(padded_length - raw.length()); -pad.zero(); -prepared-push_back(pad); + unsigned pad_len = blocksize * k - raw.length(); + + bufferlist prepared = raw; + + if (!prepared.is_aligned()) { +prepared.rebuild_aligned(); + } + + for (unsigned int i = 0; i k - !!pad_len; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.substr_of(prepared, i * blocksize, blocksize); } It is possible for more than one chunk to be padding. It's a border case but... for instance with alignment = 16, k=12 and in of length 1550 you end up with two padding chunks because the blocksize is 144. + if (pad_len 0) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[k + - 1] : k - 1; +bufferlist chunk = encoded[chunk_index]; +bufferptr padded(buffer::create_aligned(blocksize)); +raw.copy((k - 1) * blocksize, blocksize - pad_len, padded.c_str()); +padded.zero(blocksize - pad_len, pad_len); +chunk.push_back(padded); } - unsigned coding_length = blocksize * m; - bufferptr coding(buffer::create_page_aligned(coding_length)); - prepared-push_back(coding); - prepared-rebuild_page_aligned(); + for (unsigned int i = k; i k + m; i++) { +int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; +bufferlist chunk = encoded[chunk_index]; +chunk.push_back(buffer::create_aligned(blocksize)); + } + return 0; } @@ -80,15 +96,9 @@ int ErasureCode::encode(const setint want_to_encode, unsigned int k = get_data_chunk_count(); unsigned int m = get_chunk_count() - k; bufferlist out; - int err = encode_prepare(in, out); + int err = encode_prepare(in, *encoded); if (err) return err; - unsigned blocksize = get_chunk_size(in.length()); - for (unsigned int i = 0; i k + m; i++) { -int chunk_index = chunk_mapping.size() 0 ? chunk_mapping[i] : i; -bufferlist chunk = (*encoded)[chunk_index]; -chunk.substr_of(out, i
RE: Deadlock in ceph journal
Hi all, At weekend, I read the kernel code about aio direction. For close(), it don't wait aio to complete. But for fsync(), it will wait all aio to complete. Mark used this patch(which using fsync() on write_thread_entry) and the result is looks good. I want to revert the patch which don't use aio when closing journal. And using fsync(). It make the code simple. How about this? Thanks! Jianpeng ceph-devel@vger.kernel.org Subject: Re: Deadlock in ceph journal On 23/08/14 10:22, Somnath Roy wrote: I think it is using direct io for non-aio mode as well. Thanks Regards Somnath One thing that does still concern me - if I understand what is happening here correctly: we write to the journal using aio until we want to stop doing writes (presumably pre closing it), then use normal io to write at that point. Given that we appear to be using direct io whenever we use aio, does this mean we end up mixing direct and buffered io to the journal [1] (or is the normal i.e non aio write still using direct io)? Thanks Somnath, I think you are correct (I missed the bit in FileJournal::_open that seems to cover this): if (forwrite) { flags = O_RDWR; if (directio) flags |= O_DIRECT | O_DSYNC; i.e the journal is opened with DIRECT, so all writes (async or not) will be direct. Cheers Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Deadlock in ceph journal
The attachment is the patch. Jianpeng -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, August 25, 2014 10:15 AM To: Ma, Jianpeng Cc: Mark Kirkwood; Somnath Roy; Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org Subject: RE: Deadlock in ceph journal Sounds good. Can you send a patch? sage On Mon, 25 Aug 2014, Ma, Jianpeng wrote: Hi all, At weekend, I read the kernel code about aio direction. For close(), it don't wait aio to complete. But for fsync(), it will wait all aio to complete. Mark used this patch(which using fsync() on write_thread_entry) and the result is looks good. I want to revert the patch which don't use aio when closing journal. And using fsync(). It make the code simple. How about this? Thanks! Jianpeng ceph-devel@vger.kernel.org Subject: Re: Deadlock in ceph journal On 23/08/14 10:22, Somnath Roy wrote: I think it is using direct io for non-aio mode as well. Thanks Regards Somnath One thing that does still concern me - if I understand what is happening here correctly: we write to the journal using aio until we want to stop doing writes (presumably pre closing it), then use normal io to write at that point. Given that we appear to be using direct io whenever we use aio, does this mean we end up mixing direct and buffered io to the journal [1] (or is the normal i.e non aio write still using direct io)? Thanks Somnath, I think you are correct (I missed the bit in FileJournal::_open that seems to cover this): if (forwrite) { flags = O_RDWR; if (directio) flags |= O_DIRECT | O_DSYNC; i.e the journal is opened with DIRECT, so all writes (async or not) will be direct. Cheers Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html 0001-os-FileJournal-Before-write_thread_entry-exit-using-.patch Description: 0001-os-FileJournal-Before-write_thread_entry-exit-using-.patch
RE: Deadlock in ceph journal
Yes, Maybe for io_submit, it must use io_getevent. Otherwise the result is undefined. If stop_write == true, we don't use aio. How about this way? Jianpeng -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, August 20, 2014 11:34 PM To: Somnath Roy Cc: Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org; Mark Kirkwood; Ma, Jianpeng Subject: RE: Deadlock in ceph journal I suspect what is really needed is a drain_aio() function that will wait for all pending aio ops to complete on shutdown. What happens to those IOs if the process exists while they are in flight is probably undefined; we should just avoid doing that. sage On Wed, 20 Aug 2014, Somnath Roy wrote: I will also take the patch and test it out. Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, August 19, 2014 9:51 PM To: Somnath Roy Cc: Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org; Mark Kirkwood; jianpeng...@intel.com Subject: RE: Deadlock in ceph journal On Wed, 20 Aug 2014, Somnath Roy wrote: Thanks Sage ! So, the latest master should have the fix, right ? The original patch that caused the regression is reverted, but we'd like to reapply it if we sort out the issues. wip-filejournal has the offending patch and your fix.. but I'm eager to hear if Jianpeng and Mark can confirm it's complete/correct or if there is still a problem. sage Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, August 19, 2014 8:55 PM To: Somnath Roy Cc: Samuel Just (sam.j...@inktank.com); ceph-devel@vger.kernel.org; Mark Kirkwood; jianpeng...@intel.com Subject: RE: Deadlock in ceph journal [Copying ceph-devel, dropping ceph-users] Yeah, that looks like a bug. I pushed wip-filejournal that reapplies Jianpeng's original patch and this one. I'm not certain about last other suggested fix, though, but I'm hoping that this fix explains the strange behavior Jianpeng and Mark have seen? sage On Wed, 20 Aug 2014, Somnath Roy wrote: I think this is the issue.. http://tracker.ceph.com/issues/9073 Thanks Regards Somnath From: Somnath Roy Sent: Tuesday, August 19, 2014 6:25 PM To: Sage Weil (s...@inktank.com); Samuel Just (sam.j...@inktank.com) Cc: ceph-us...@lists.ceph.com Subject: Deadlock in ceph journal Hi Sage/Sam, During our testing we found a potential deadlock scenario in the filestore journal code base. This is happening because of two reason. 1. This is because code is not signaling aio_cond from check_aio_completion() in case seq = 0 2. Following changes in the write_thread_entry() is allowing a very first header write with seq = 0. if (writeq.empty() !must_write_header) { Now, during ceph-deploy activate this is what happening. 1. The very first write of header with seq = 0 issued and it is waiting for aio completion. So, aio_num = 1. 2. superblock write came in and got into while (aio_num 0) block of write_thread_entry() and waiting on the aio_cond 3. The seq = 0 aio completed but not setting completed_something = true and as a result aio_cond is not signaled. 4. write_thread_entry() is getting into deadlock. This is a timing problem and if header write is returned before superblock write this will not happen and will be happening in case of block journal device only (aio is enabled). Here is the log snippet we are getting. 2014-08-19 12:59:10.029363 7f60fa33b700 10 journal write_thread_entry start 2014-08-19 12:59:10.029395 7f60fa33b700 20 journal prepare_multi_write queue_pos now 4096 2014-08-19 12:59:10.029427 7f60fa33b700 15 journal do_aio_write writing 4096~0 + header 2014-08-19 12:59:10.029439 7f60fa33b700 20 journal write_aio_bl 0~4096 seq 0 2014-08-19 12:59:10.029442 7f60f9339700 10 journal write_finish_thread_entry enter 2014-08-19 12:59:10.029466 7f60fa33b700 20 journal write_aio_bl .. 0~4096 in 1 2014-08-19 12:59:10.029498 7f60fa33b700 20 journal write_aio_bl 4096~0 seq 0 2014-08-19 12:59:10.029505 7f60fa33b700 5 journal put_throttle finished 0 ops and 0 bytes, now 0 ops and 0 bytes 2014-08-19 12:59:10.029510 7f60fa33b700 20 journal write_thread_entry going to sleep 2014-08-19 12:59:10.029538 7f60ff178800 10 journal journal_start 2014-08-19 12:59:10.029566 7f60f9339700 20 journal write_finish_thread_entry waiting for aio(s) 2014-08-19 12:59:10.029726 7f60ff178800 15 filestore(/var/lib/ceph/tmp
RE: A bug in FileStore::mount and A problem about XFS
I think you got the old code. I'm not find FX_TYPE_OTHER -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of n...@cs.hku.hk Sent: Thursday, August 14, 2014 12:08 PM To: ceph-devel@vger.kernel.org Subject: A bug in FileStore::mount and A problem about XFS Dear developers, I find a bug in FileStore::mount() function. If we use raw partitial as a Journal disk, like /dev/sda1, then it will cause a error when setting the parameters m_filestore_max_inline_xattr_size and m_filestore_max_inline_xattrs. First of all, when creating the FileStore, it calls the mount() and detect fs_type as XFS, so it will check m_filestore_replica_fadvise(true default) and set it to false. At the same time, m_fs_tpye = FS_TYPE_XFS. However, when create JournalStore, it also enters mount() function and detect fs_type. Unfourtuanlly, this time, m_filestore_replica_fadvise = false so that it cannot update m_fs_type = FS_TYPE_XFS. Instread, it will sets to m_fs_tpye = FS_TYPE_OTHER. This bug is quite easy to repair. By the way, I wanna ask another question about XFS, I find that if I use ceph cluster heavily for about 2-3 hour. XFS always wakes up a kernel thread (xfsaild) to deal with its inodes reading and writing. This will impact the performance a lot and make the cluster's throughput drop heavily. So how to solve this problem and why this occurs after running a heavily read, write operation after 2-3 hour. Best Regards, Neal Yao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ISA erasure code plugin and cache
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Loic Dachary Sent: Monday, August 4, 2014 8:37 PM To: Andreas Joachim Peters Cc: Ma, Jianpeng; Ceph Development Subject: Re: ISA erasure code plugin and cache On 04/08/2014 14:15, Andreas Joachim Peters wrote: Hi Loic, the background relevant to your comments have (unfortunately) never been answered on the mailing list. The cache is written in a way, that it is useful for a fixed (k,m) combination and thread-safe. So, if there is one instance of the plugin per pool, it is the right implementation. If there is (as you write) one instance for each PG in any pool, it is sort of 'stupid' because the same encoding table is stored for each PG seperatly. I would not called it stupid ;-) Just not as efficient as if the cache was by all PGs. Without cache the decode table has to be calculated for each object in the placement group and there are a lot of objects. The table may be duplicated hundreds of time so it has an impact on the memory footprint, but it should not have a visible impact on the decode performances. An optimisation of your implementation to save memory would be nice, but it is not critical. AFAIK, for a recovery pg, all the objects of pg have the same lost chunks. So only the first object miss the cache. But the later won't. It only need a entry to cache. Or am I missing something? Jianpeng Ma How large are the decode tables ? So if the final statement is to create one plugin instance per PG, I will change it accordingly and shared encoding decoding tables for a fixed (k,m), if not it can stay. Just need to know that ... this boils down to the fact, that encoding decoding should not be considered 'stateless'. Cheers Andreas. From: Loic Dachary [l...@dachary.org] Sent: 04 August 2014 13:56 To: Andreas Joachim Peters Cc: Ma, Jianpeng; Ceph Development Subject: ISA erasure code plugin and cache Hi, Here is how I understand the current code: When an OSD is missing, recovery is required and the primary OSD will collect the available chunks to do so. It will then call the decode method via ECUtil::decode which is a small wrapper around the corresponding ErasureCodeInterface::decode method. https://github.com/ceph/ceph/blob/master/src/osd/ECBackend.cc#L361 The ISA plugin will then use the isa_decode method to perform the work https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode Isa.cc#L212 and will be repeatedly called until all objects in the PGs that were relying on the missing OSD are recovered. To avoid computing the decoding table for each object, it is stored in a LRU cache https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode Isa.cc#L480 and copied in the stack if already there: https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode Isa.cc#L433 Each PG has a separate instance of ErasureCodeIsa, obtained when it is created: https://github.com/ceph/ceph/blob/master/src/osd/PGBackend.cc#L292 It means that data members of each ErasureCodeIsa are copied as many times as there are PGs. If an OSD handles participates in 200 PG that belong to an erasure coded pool configured to use ErasureCodeIsa, the data members will be duplicated 200 times. It is good practice to make it so that the encode/decode methods of ErasureCodeIsa are thread safe. In the jerasure plugin these methods have no side effect on the object. In the isa plugin the LRU cache storing the decode tables is modified by the decode method and guarded by a mutex: get https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode Isa.cc#L281 put https://github.com/ceph/ceph/blob/master/src/erasure-code/isa/ErasureCode Isa.cc#L310 Please correct me if I'm mistaken ;-) I've not reviewed the code yet and try to find problems, I just wanted to make sure I get the intention before doing so. Cheers -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH]ECBackend: Don't directyly use get_recovery_chunk_size() in RecoveryOp::WRITING state.
We cannot guarantee that conf-osd_recovery_max_chunk don't change when recoverying a erasure object. If change between RecoveryOp::READING and RecoveryOp::WRITING, it can cause this bug: 2014-07-30 10:12:09.599220 7f7ff26c0700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp, RecoveryMessages*)' thread 7f7ff26c0700 time 2014-07-30 10:12:09.596837 osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to)) ceph version 0.83-383-g3cfda57 (3cfda577b15039cb5c678b79bef3e561df826ed1) 1: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp,RecoveryMessages*)+0x1a50) [0x928070] 2: (ECBackend::handle_recovery_read_complete(hobject_t const, boost::tuples::tupleunsigned long, unsigned long, std::mappg_shard_t, ceph::buffer::list, std::lesspg_shard_t, std::allocatorstd::pairpg_shard_t const, ceph::buffer::list , boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::optionalstd::mapstd::string, ceph::buffer::list, std::lessstd::string, std::allocatorstd::pairstd::string const, ceph::buffer::list , RecoveryMessages*)+0x90c) [0x92952c] 3: (OnRecoveryReadComplete::finish(std::pairRecoveryMessages*, ECBackend::read_result_t)+0x121) [0x938481] 4: (GenContextstd::pairRecoveryMessages*, ECBackend::read_result_t::complete(std::pairRecoveryMessages*, ECBackend::read_result_t)+0x9) [0x929d69] 5: (ECBackend::complete_read_op(ECBackend::ReadOp,RecoveryMessages*)+0x63) [0x91c6e3] 6: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply,RecoveryMessages*)+0x96d) [0x920b4d] 7: (ECBackend::handle_message(std::tr1::shared_ptrOpRequest)+0x17e)[0x92884e] 8: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,ThreadPool::TPHandle)+0x23b) [0x7b34db] 9: (OSD::dequeue_op(boost::intrusive_ptrPG,std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x428) [0x638d58] 10: (OSD::ShardedOpWQ::_process(unsigned int,ceph::heartbeat_handle_d*)+0x346) [0x6392f6] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce)[0xa5caae] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa5ed00] 13: (()+0x8182) [0x7f800b5d3182] 14: (clone()+0x6d) [0x7f800997430d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. So we only get the get_recovery_chunk_size() at RecoverOp::READING and record it using RecoveryOp::extent_requested. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/ECBackend.cc | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/osd/ECBackend.cc b/src/osd/ECBackend.cc index 2a3913a..e9402e8 100644 --- a/src/osd/ECBackend.cc +++ b/src/osd/ECBackend.cc @@ -476,6 +476,7 @@ void ECBackend::continue_recovery_op( assert(!op.recovery_progress.data_complete); setint want(op.missing_on_shards.begin(), op.missing_on_shards.end()); setpg_shard_t to_read; + uint64_t recovery_max_chunk = get_recovery_chunk_size(); int r = get_min_avail_to_read_shards( op.hoid, want, true, to_read); if (r != 0) { @@ -492,11 +493,11 @@ void ECBackend::continue_recovery_op( this, op.hoid, op.recovery_progress.data_recovered_to, - get_recovery_chunk_size(), + recovery_max_chunk, to_read, op.recovery_progress.first); op.extent_requested = make_pair(op.recovery_progress.data_recovered_to, - get_recovery_chunk_size()); + recovery_max_chunk); dout(10) __func__ : IDLE return op dendl; return; } @@ -506,7 +507,7 @@ void ECBackend::continue_recovery_op( assert(op.returned_data.size()); op.state = RecoveryOp::WRITING; ObjectRecoveryProgress after_progress = op.recovery_progress; - after_progress.data_recovered_to += get_recovery_chunk_size(); + after_progress.data_recovered_to += op.extent_requested.second; after_progress.first = false; if (after_progress.data_recovered_to = op.obc-obs.oi.size) { after_progress.data_recovered_to = -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH]ECBackend: Using ROUND_UP_TO to refactor get_recovery_chunk_size()
Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/ECBackend.h | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/src/osd/ECBackend.h b/src/osd/ECBackend.h index 28bcf8a..0e96407 100644 --- a/src/osd/ECBackend.h +++ b/src/osd/ECBackend.h @@ -149,10 +149,8 @@ public: private: friend struct ECRecoveryHandle; uint64_t get_recovery_chunk_size() const { -uint64_t max = cct-_conf-osd_recovery_max_chunk; -max -= max % sinfo.get_stripe_width(); -max += sinfo.get_stripe_width(); -return max; +return ROUND_UP_TO(cct-_conf-osd_recovery_max_chunk, + sinfo.get_stripe_width()); } /** -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Pull Request for ISA EC plug-in
Hi, At my machine, I also met this bug. But I modify this, it can work. diff --git a/src/erasure-code/isa/Makefile.am b/src/erasure-code/isa/Makefile.am index eebffea..dee593b 100644 --- a/src/erasure-code/isa/Makefile.am +++ b/src/erasure-code/isa/Makefile.am @@ -34,7 +34,7 @@ libec_isa_la_SOURCES = ${isa_sources} libec_isa_la_CFLAGS = ${AM_CFLAGS} -I $(srcdir)/erasure-code/isa/isa-l/include/ libec_isa_la_CXXFLAGS = ${AM_CXXFLAGS} -I $(srcdir)/erasure-code/isa/isa-l/include/ -libec_isa_la_CCASFLAGS = ${AM_CCASFLAGS} -I $(srcdir)/erasure-code/isa/isa-l/include/ +libec_isa_la_CCASFLAGS = ${AM_CCASFLAGS} -i $(srcdir)/erasure-code/isa/isa-l/include libec_isa_la_LIBADD = $(LIBCRUSH) $(PTHREAD_LIBS) $(EXTRALIBS) libec_isa_la_LDFLAGS = ${AM_LDFLAGS} -version-info 2:10:0 As Andreas said, the different is -I and -i. But how to choose? My yasm version is 1.2.0. Thanks! -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, July 31, 2014 7:02 AM To: Andreas Joachim Peters Cc: ceph-devel@vger.kernel.org Subject: RE: Pull Request for ISA EC plug-in On Wed, 30 Jul 2014, Andreas Joachim Peters wrote: Hi Sage, my guess it comes from the yasm include option which evt. is non-standard. Mine documents '-i' and '-I' while I found on the web that probably only '-I' is the portable one. I have changed that, still I can only guess, since I cannot reproduce. Hmm, how are you building it? Running make on my work box (precise) also fails. I modified yasm-wrapper slightly[1] and it's gets further, but still no dice... Thanks! sage [1] https://github.com/ceph/ceph/commit/wip-ec-isa Please rerun. Thanks, Andreas. From: Sage Weil [sw...@redhat.com] Sent: 30 July 2014 19:46 To: Andreas Joachim Peters Cc: ceph-devel@vger.kernel.org Subject: RE: Pull Request for ISA EC plug-in On Mon, 28 Jul 2014, Andreas Joachim Peters wrote: Hi Sage, I fixed that. I missed '$(srcdir)' in the assembler and C/C++ include statements (I always compiled in src dir ..) You can retry. I think there are still a few issues: make[4]: *** [erasure-code/isa/isa-l/erasure_code/libec_isa_la-ec_base.lo] Error 1 make[4]: *** Waiting for unfinished jobs ./yasm-wrapper: yasm -f elf64 -i ./erasure-code/isa/isa-l/include/ erasure-code/isa/isa-l/erasure_code/ec_multibinary.asm.s -o erasure-code/isa/isa-l/erasure_code/.libs/libec_isa_la-ec_multibinary. asm.o FATAL: yasm: unable to open include file `reg_sizes.asm' ./yasm-wrapper: yasm -f elf64 -i ./erasure-code/isa/isa-l/include/ erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s -o erasure-code/isa/isa-l/erasure_code/.libs/libec_isa_la-gf_2vect_dot_pr od_avx2.asm.o error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:157: undefined symbol `gf_2vect_dot_prod_avx2.return_fail' (first use) error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:157: (Each undefined symbol is reported only once.) error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:161: instruction expected after label error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:178: instruction expected after label error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:179: instruction expected after label error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:183: instruction expected after label error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:184: instruction expected after label error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:206: undefined symbol `vperm2i128.next_vect' (first use) error: erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx2.asm.s:213: undefined symbol `vperm2i128.loop32' (first use) make[4]: *** [erasure-code/isa/isa-l/erasure_code/libec_isa_la-ec_multibinary.asm.l o] Error 1 make[4]: *** [erasure-code/isa/isa-l/erasure_code/libec_isa_la-gf_2vect_dot_prod_av x2.asm.lo] Error 1 ./yasm-wrapper: yasm -f elf64 -i ./erasure-code/isa/isa-l/include/ erasure-code/isa/isa-l/erasure_code/gf_2vect_dot_prod_avx.asm.s -o erasure-code/isa/isa-l/erasure_code/.libs/libec_isa_la-gf_2vect_dot_pr od_avx.asm.o In file included from erasure-code/isa/isa-l/erasure_code/ec_highlevel_func.c:30:0: ./erasure-code/isa/isa-l/include/erasure_code.h:52:25: fatal error: gf_vect_mul.h: No such file or directory compilation terminated. This is from http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-precise-amd64-basic/log. cgi?log=c7e765e22427958c0d03fbfec42a29aa34895ef3. Thanks! sage Thanks Andreas. From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on
RE: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot.
Hrm, I'd really like to see the startup sequence. I see the crash occurring, but I don't understand how it's happening — we test this pretty extensively so there must be something about your testing configuration that is different than ours. Can you provide that part of the log, and maybe a little more description of what you think the problem is? If the ceph.conf contain cluster addr, the bug must occur. For no cluster addr in ceph.conf, the local-connection add to fast-dispatch in func _send_boot/ cluster_messenger-set_addr_unknowns. In particular, we *always* call init_local_connection when the messenger starts, so every messenger who is allowed to receive EC messages should have the local connection set up before they get one. Yes you call init_local_connection. But only adding osd to messenger, the local_conenction can add to dispatch. In func OSD::init cluster_messenger-add_dispatcher_head(this); Only after this, the local_connection can add to dispatch. Because if local_connection has correct type, it can add to dispatch and don’t' care the cluster addr. When allocate a Messenger, it set the type and only after add_dispatcher_head/tail, the local-connection can add to dispatch. Maybe add ms_deliver_handle_fast_connect(local_connection.get()) in SimpleMessenger::ready is better. Jianpeng Ma I don't really see how supplying the local connection as a new one in _send_boot *should* be fixing that, and it's not the place to do so (although I guess it's doing *something*, I just can't figure out what). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Jul 16, 2014 at 5:17 PM, Ma, Jianpeng jianpeng...@intel.com wrote: Hi Greg, The attachment is the log. Thanks! -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Thursday, July 17, 2014 3:41 AM To: Ma, Jianpeng Cc: ceph-devel@vger.kernel.org Subject: Re: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot. I'm looking at this and getting a little confused. Can you provide a log of the crash occurring? (preferably with debug_ms=20, debug_osd=20) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Jul 13, 2014 at 8:17 PM, Ma, Jianpeng jianpeng...@intel.com wrote: When do ec-read, i met a bug which was occured 100%. The messages are: 2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function 'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time 2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session) ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6) 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6] 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6] 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b] 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd] 5: (()+0x8182) [0x7f7665670182] 6: (clone()+0x6d) [0x7f7663a1130d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. In commit 69fc6b2b66, it enable fast_dispatch on local connections and it will add local_connection to fast_dispatch in func init_local_connection. But if there is no fast-dispatch, the local connection can't add. If there is no clutser addr in ceph.conf, it will add local_connection to fast dispatch in func _send_boot because the cluster_addr is empty. But if there is cluster addr, local_connection can't add to fast dispatch. For ECSubRead, it send to itself by func send_message_osd_cluster so it will cause this bug. I don't know about hb_back/front_server_messenger. But they are in _send_boot like cluster_messenger, so i also modified those. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/OSD.cc | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 52a3839..75b294b 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -3852,29 +3852,37 @@ void OSD::_send_boot() { dout(10) _send_boot dendl; entity_addr_t cluster_addr = cluster_messenger-get_myaddr(); + Connection *local_connection = + cluster_messenger-get_loopback_connection().get(); if (cluster_addr.is_blank_ip()) { int port = cluster_addr.get_port(); cluster_addr = client_messenger-get_myaddr(); cluster_addr.set_port(port); cluster_messenger-set_addr_unknowns(cluster_addr); dout(10) assuming cluster_addr ip matches client_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + + cluster_messenger-ms_deliver_handle_fast_connect(local_connection) + ; + entity_addr_t hb_back_addr = hb_back_server_messenger-get_myaddr(); + local_connection = + hb_back_server_messenger-get_loopback_connection().get(); if (hb_back_addr.is_blank_ip()) { int port
[PATCH] os/FileJournal: When dump journal, using correctly last-seq avoid misjudging joural corrupt.
In func FileJournal::dump, it always use seq=0 as last-seq and misjudge the journal corrupt. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/os/FileJournal.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc index 7eb7927..3307655 100644 --- a/src/os/FileJournal.cc +++ b/src/os/FileJournal.cc @@ -564,9 +564,9 @@ int FileJournal::dump(ostream out) JSONFormatter f(true); f.open_array_section(journal); + uint64_t seq = 0; while (1) { bufferlist bl; -uint64_t seq = 0; uint64_t pos = read_pos; if (!read_entry(bl, seq)) { dout(3) journal_replay: end of journal, done. dendl; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot.
Ping... -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ma, Jianpeng Sent: Monday, July 14, 2014 11:17 AM To: g...@inktank.com Cc: ceph-devel@vger.kernel.org Subject: [RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot. When do ec-read, i met a bug which was occured 100%. The messages are: 2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function 'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time 2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session) ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6) 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6] 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6] 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b] 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd] 5: (()+0x8182) [0x7f7665670182] 6: (clone()+0x6d) [0x7f7663a1130d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. In commit 69fc6b2b66, it enable fast_dispatch on local connections and it will add local_connection to fast_dispatch in func init_local_connection. But if there is no fast-dispatch, the local connection can't add. If there is no clutser addr in ceph.conf, it will add local_connection to fast dispatch in func _send_boot because the cluster_addr is empty. But if there is cluster addr, local_connection can't add to fast dispatch. For ECSubRead, it send to itself by func send_message_osd_cluster so it will cause this bug. I don't know about hb_back/front_server_messenger. But they are in _send_boot like cluster_messenger, so i also modified those. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/OSD.cc | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 52a3839..75b294b 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -3852,29 +3852,37 @@ void OSD::_send_boot() { dout(10) _send_boot dendl; entity_addr_t cluster_addr = cluster_messenger-get_myaddr(); + Connection *local_connection = + cluster_messenger-get_loopback_connection().get(); if (cluster_addr.is_blank_ip()) { int port = cluster_addr.get_port(); cluster_addr = client_messenger-get_myaddr(); cluster_addr.set_port(port); cluster_messenger-set_addr_unknowns(cluster_addr); dout(10) assuming cluster_addr ip matches client_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + + cluster_messenger-ms_deliver_handle_fast_connect(local_connection); + entity_addr_t hb_back_addr = hb_back_server_messenger-get_myaddr(); + local_connection = + hb_back_server_messenger-get_loopback_connection().get(); if (hb_back_addr.is_blank_ip()) { int port = hb_back_addr.get_port(); hb_back_addr = cluster_addr; hb_back_addr.set_port(port); hb_back_server_messenger-set_addr_unknowns(hb_back_addr); dout(10) assuming hb_back_addr ip matches cluster_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + + hb_back_server_messenger-ms_deliver_handle_fast_connect(local_connect + ion); + entity_addr_t hb_front_addr = hb_front_server_messenger-get_myaddr(); + local_connection = + hb_front_server_messenger-get_loopback_connection().get(); if (hb_front_addr.is_blank_ip()) { int port = hb_front_addr.get_port(); hb_front_addr = client_messenger-get_myaddr(); hb_front_addr.set_port(port); hb_front_server_messenger-set_addr_unknowns(hb_front_addr); dout(10) assuming hb_front_addr ip matches client_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + + hb_front_server_messenger-ms_deliver_handle_fast_connect(local_connec + tion); MOSDBoot *mboot = new MOSDBoot(superblock, service.get_boot_epoch(), hb_back_addr, hb_front_addr, cluster_addr); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qa/workunits/erasure-code/bench.sh: Using correcting parameter name when call ceph_erasure_code_benchmark.
In commit bcc74879a, loic modifed the parameter name. But he forgot to modify the name in qa/workunits/erasure-code/bench.sh. Now modify and make bench.sh work. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- qa/workunits/erasure-code/bench.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/qa/workunits/erasure-code/bench.sh b/qa/workunits/erasure-code/bench.sh index db7c490..483450b 100755 --- a/qa/workunits/erasure-code/bench.sh +++ b/qa/workunits/erasure-code/bench.sh @@ -50,9 +50,9 @@ function bench() { --iterations $iterations \ --size $size \ --erasures $erasures \ ---parameter erasure-code-k=$k \ ---parameter erasure-code-m=$m \ ---parameter erasure-code-directory=$PLUGIN_DIRECTORY) +--parameter k=$k \ +--parameter m=$m \ +--parameter directory=$PLUGIN_DIRECTORY) result=$($command $@) echo -e $result\t$plugin\t$k\t$m\t$workload\t$iterations\t$size\t$erasures\t$command $@ } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] osd: Add local_connection to fast_dispatch in func _send_boot.
When do ec-read, i met a bug which was occured 100%. The messages are: 2014-07-14 10:03:07.318681 7f7654f6e700 -1 osd/OSD.cc: In function 'virtual void OSD::ms_fast_dispatch(Message*)' thread 7f7654f6e700 time 2014-07-14 10:03:07.316782 osd/OSD.cc: 5019: FAILED assert(session) ceph version 0.82-585-g79f3f67 (79f3f6749122ce2944baa70541949d7ca75525e6) 1: (OSD::ms_fast_dispatch(Message*)+0x286) [0x6544b6] 2: (DispatchQueue::fast_dispatch(Message*)+0x56) [0xb059d6] 3: (DispatchQueue::run_local_delivery()+0x6b) [0xb08e0b] 4: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0xa4a5fd] 5: (()+0x8182) [0x7f7665670182] 6: (clone()+0x6d) [0x7f7663a1130d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. In commit 69fc6b2b66, it enable fast_dispatch on local connections and it will add local_connection to fast_dispatch in func init_local_connection. But if there is no fast-dispatch, the local connection can't add. If there is no clutser addr in ceph.conf, it will add local_connection to fast dispatch in func _send_boot because the cluster_addr is empty. But if there is cluster addr, local_connection can't add to fast dispatch. For ECSubRead, it send to itself by func send_message_osd_cluster so it will cause this bug. I don't know about hb_back/front_server_messenger. But they are in _send_boot like cluster_messenger, so i also modified those. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/OSD.cc | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 52a3839..75b294b 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -3852,29 +3852,37 @@ void OSD::_send_boot() { dout(10) _send_boot dendl; entity_addr_t cluster_addr = cluster_messenger-get_myaddr(); + Connection *local_connection = cluster_messenger-get_loopback_connection().get(); if (cluster_addr.is_blank_ip()) { int port = cluster_addr.get_port(); cluster_addr = client_messenger-get_myaddr(); cluster_addr.set_port(port); cluster_messenger-set_addr_unknowns(cluster_addr); dout(10) assuming cluster_addr ip matches client_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + cluster_messenger-ms_deliver_handle_fast_connect(local_connection); + entity_addr_t hb_back_addr = hb_back_server_messenger-get_myaddr(); + local_connection = hb_back_server_messenger-get_loopback_connection().get(); if (hb_back_addr.is_blank_ip()) { int port = hb_back_addr.get_port(); hb_back_addr = cluster_addr; hb_back_addr.set_port(port); hb_back_server_messenger-set_addr_unknowns(hb_back_addr); dout(10) assuming hb_back_addr ip matches cluster_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + hb_back_server_messenger-ms_deliver_handle_fast_connect(local_connection); + entity_addr_t hb_front_addr = hb_front_server_messenger-get_myaddr(); + local_connection = hb_front_server_messenger-get_loopback_connection().get(); if (hb_front_addr.is_blank_ip()) { int port = hb_front_addr.get_port(); hb_front_addr = client_messenger-get_myaddr(); hb_front_addr.set_port(port); hb_front_server_messenger-set_addr_unknowns(hb_front_addr); dout(10) assuming hb_front_addr ip matches client_addr dendl; - } + } else if (local_connection-get_priv() == NULL) + hb_front_server_messenger-ms_deliver_handle_fast_connect(local_connection); MOSDBoot *mboot = new MOSDBoot(superblock, service.get_boot_epoch(), hb_back_addr, hb_front_addr, cluster_addr); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] ReplicatedPG: For async-read, set the real result after completing read.
Hi Sam, I'm a newbie and I don’t have my ceph repo in github. So I can't send a full request. Jianpeng Ma Thanks! -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Saturday, July 12, 2014 1:21 AM To: Ma, Jianpeng Cc: ceph-devel@vger.kernel.org Subject: Re: [PATCH] ReplicatedPG: For async-read, set the real result after completing read. Looks about right, file a pull request? -Sam On Thu, Jul 10, 2014 at 2:23 AM, Ma, Jianpeng jianpeng...@intel.com wrote: When reading an object from replicated pool, ceph uses sync mode, so it can set the results in execute_ctx correctly. However, For the async-read in EC Pool, current code didn't set the real results after read in complete_read_ctx. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/ReplicatedPG.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index cab3fde..ac0e8ad 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc @@ -5318,7 +5318,7 @@ void ReplicatedPG::complete_read_ctx(int result, OpContext *ctx) // on ENOENT, set a floor for what the next user version will be. reply-set_enoent_reply_versions(info.last_update, info.last_user_version); } - + reply-set_result(result); reply-add_flags(CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK); osd-send_message_osd_client(reply, m-get_connection()); close_op_ctx(ctx, 0); -- 1.9.1
[PATCH] ReplicatedPG: For async-read, set the real result after completing read.
When reading an object from replicated pool, ceph uses sync mode, so it can set the results in execute_ctx correctly. However, For the async-read in EC Pool, current code didn't set the real results after read in complete_read_ctx. Signed-off-by: Ma Jianpeng jianpeng...@intel.com --- src/osd/ReplicatedPG.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index cab3fde..ac0e8ad 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc @@ -5318,7 +5318,7 @@ void ReplicatedPG::complete_read_ctx(int result, OpContext *ctx) // on ENOENT, set a floor for what the next user version will be. reply-set_enoent_reply_versions(info.last_update, info.last_user_version); } - + reply-set_result(result); reply-add_flags(CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK); osd-send_message_osd_client(reply, m-get_connection()); close_op_ctx(ctx, 0); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd.cc: Check io-size avoid floating point exception.
In func do_bench_write if io_size is zero,it can cause floating point execption. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- src/rbd.cc | 8 1 file changed, 8 insertions(+) diff --git a/src/rbd.cc b/src/rbd.cc index d6658e3..c068ed6 100644 --- a/src/rbd.cc +++ b/src/rbd.cc @@ -2038,6 +2038,14 @@ int main(int argc, const char **argv) return EXIT_FAILURE; } } else if (ceph_argparse_withlonglong(args, i, bench_io_size, err, --io-size, (char*)NULL)) { + if (!err.str().empty()) { + cerr rbd: err.str() std::endl; + return EXIT_FAILURE; + } + if (bench_io_size == 0) { + cerr rbd: io-size must be 0 std::endl; + return EXIT_FAILURE; + } } else if (ceph_argparse_withlonglong(args, i, bench_io_threads, err, --io-threads, (char*)NULL)) { } else if (ceph_argparse_withlonglong(args, i, bench_bytes, err, --io-total, (char*)NULL)) { } else if (ceph_argparse_witharg(args, i, bench_pattern, err, --io-pattern, (char*)NULL)) { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html