Re: [Gluster-devel] 答复: Re: Gluster AFR volume write performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev
LiPing10168633/user/zte_ltd 写于 2016/01/28 21:40:30: > From: 李平10168633/user/zte_ltd > To: Pranith Kumar Karampuri <pkara...@redhat.com>, gluster-devel@gluster.org, > Cc: li.y...@zte.com.cn, liu.jianj...@zte.com.cn, > yang.bi...@zte.com.cn, zhou.shigan...@zte.com.cn > Date: 2016/01/28 21:40 > Subject: Re: 答复: Re: [Gluster-devel] Gluster AFR volume write > performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND > in afr_writev > > Sorry for the late reply. > > Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/25 17:48:06: > > > From: Pranith Kumar Karampuri <pkara...@redhat.com> > > To: li.ping...@zte.com.cn, > > Cc: li.y...@zte.com.cn, zhou.shigan...@zte.com.cn, > > liu.jianj...@zte.com.cn, yang.bi...@zte.com.cn > > Date: 2016/01/25 17:48 > > Subject: Re: 答复: Re: [Gluster-devel] Gluster AFR volume write > > performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND > > in afr_writev > > > > > > On 01/25/2016 03:09 PM, li.ping...@zte.com.cn wrote: > > Hi Pranith, > > > > I'd be willing to have a chance to do my contribution to open-source. > > It's my first time to deliver a patch for GlusterFS, hence I'm not > > quite familiar with the code review and submitting procedures. > > > > I'll try to make it ASAP. By the way is there any guidelines to dothis work? > > http://www.gluster.org/community/documentation/index.php/ > > Simplified_dev_workflow may be helpful. Feel free to ask any doubt > > you may have. > > > > How do you guys use glusterfs? > > > > Pranith > Thanks for your warm tips. We currently use glusterfs to build the > shared storage for distributed cluster nodes. > > Here are the solutions I pondered over these days: > > 1,Reverting the AFR GLUSTERFS_WRITE_IS_APPEND modifications. > because this optimization only play a part for appending write fops, > but most of the time of writing it is not kind of this. Hence I > think it is not worth to do an optimization for the low probability > situation > at cost of the vast majority of AFR writing performance drop. > 2,Revising the fixed GLUSTERFS_WRITE_IS_APPEND dictionary option in > afr_writev in a dynamic way. i.e. adding a new dynamic configurable > option "write_is_append" just as the existing "ensure- > durability" for AFR. It could be configured on if AFR writing > performance is not mainly > concerned and off if the performance is demanded. > > I have been trying to find out a way in posix_writev to predict the > appending write in advance and then lock/unlock or not lock > accordingly in the > shortest and soonest, but I get no chance. 3, Another compromising solution crossing my mind today is to let the WRITE_IS_APPEND not take effect for O_DIRECT option. It is already ineffective for SYNC writing, and also the performance for page cache writing is not so bad (not as good as no locking of course). I would prefer the 2th and 3th way. Are there any other opinions? > > Anybody's other good ideas are appreciated. > > Ping.Li > > > > > Thanks & Best Regards. > > > > Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/23 14:01:36: > > > > > From: Pranith Kumar Karampuri <pkara...@redhat.com> > > > To: li.ping...@zte.com.cn, gluster-devel@gluster.org, > > > Cc: li.y...@zte.com.cn, liu.jianj...@zte.com.cn, > > > zhou.shigan...@zte.com.cn, yang.bi...@zte.com.cn > > > Date: 2016/01/23 14:02 > > > Subject: Re: 答复: Re: [Gluster-devel] Gluster AFR volume write > > > performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND > > > in afr_writev > > > > > > > > > > > On 01/22/2016 07:14 AM, li.ping...@zte.com.cn wrote: > > > Hi Pranith, it is appreciated for your reply. > > > > > > Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/20 18:51:19: > > > > > > > 发件人: Pranith Kumar Karampuri <pkara...@redhat.com> > > > > 收件人: li.ping...@zte.com.cn, gluster-devel@gluster.org, > > > > 日期: 2016/01/20 18:51 > > > > 主题: Re: [Gluster-devel] Gluster AFR volume write performance has > > > > been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev > > > > > > > > Sorry for the delay in response. > > > > > > > On 01/15/2016 02:34 PM, li.ping...@zte.com.cn wrote: > > > > GLUSTERFS_WRITE_IS_APPEND Setting in afr_writev function at > > > > glusterfs client end makes the po
Re: [Gluster-devel] 答复: Re: Gluster AFR volume write performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev
Sorry for the late reply. Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/25 17:48:06: > From: Pranith Kumar Karampuri <pkara...@redhat.com> > To: li.ping...@zte.com.cn, > Cc: li.y...@zte.com.cn, zhou.shigan...@zte.com.cn, > liu.jianj...@zte.com.cn, yang.bi...@zte.com.cn > Date: 2016/01/25 17:48 > Subject: Re: 答复: Re: [Gluster-devel] Gluster AFR volume write > performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND > in afr_writev > > > On 01/25/2016 03:09 PM, li.ping...@zte.com.cn wrote: > Hi Pranith, > > I'd be willing to have a chance to do my contribution to open-source. > It's my first time to deliver a patch for GlusterFS, hence I'm not > quite familiar with the code review and submitting procedures. > > I'll try to make it ASAP. By the way is there any guidelines to do this work? > http://www.gluster.org/community/documentation/index.php/ > Simplified_dev_workflow may be helpful. Feel free to ask any doubt > you may have. > > How do you guys use glusterfs? > > Pranith Thanks for your warm tips. We currently use glusterfs to build the shared storage for distributed cluster nodes. Here are the solutions I pondered over these days: 1,Reverting the AFR GLUSTERFS_WRITE_IS_APPEND modifications. because this optimization only play a part for appending write fops, but most of the time of writing it is not kind of this. Hence I think it is not worth to do an optimization for the low probability situation at cost of the vast majority of AFR writing performance drop. 2,Revising the fixed GLUSTERFS_WRITE_IS_APPEND dictionary option in afr_writev in a dynamic way. i.e. adding a new dynamic configurable option "write_is_append" just as the existing "ensure-durability" for AFR. It could be configured on if AFR writing performance is not mainly concerned and off if the performance is demanded. I have been trying to find out a way in posix_writev to predict the appending write in advance and then lock/unlock or not lock accordingly in the shortest and soonest, but I get no chance. Anybody's other good ideas are appreciated. Ping.Li > > Thanks & Best Regards. > > Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/23 14:01:36: > > > From: Pranith Kumar Karampuri <pkara...@redhat.com> > > To: li.ping...@zte.com.cn, gluster-devel@gluster.org, > > Cc: li.y...@zte.com.cn, liu.jianj...@zte.com.cn, > > zhou.shigan...@zte.com.cn, yang.bi...@zte.com.cn > > Date: 2016/01/23 14:02 > > Subject: Re: 答复: Re: [Gluster-devel] Gluster AFR volume write > > performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND > > in afr_writev > > > > > > > On 01/22/2016 07:14 AM, li.ping...@zte.com.cn wrote: > > Hi Pranith, it is appreciated for your reply. > > > > Pranith Kumar Karampuri <pkara...@redhat.com> 写于 2016/01/20 18:51:19: > > > > > 发件人: Pranith Kumar Karampuri <pkara...@redhat.com> > > > 收件人: li.ping...@zte.com.cn, gluster-devel@gluster.org, > > > 日期: 2016/01/20 18:51 > > > 主题: Re: [Gluster-devel] Gluster AFR volume write performance has > > > been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev > > > > > > Sorry for the delay in response. > > > > > On 01/15/2016 02:34 PM, li.ping...@zte.com.cn wrote: > > > GLUSTERFS_WRITE_IS_APPEND Setting in afr_writev function at > > > glusterfs client end makes the posix_writev in the server end deal > > > IO write fops from parallel to serial in consequence. > > > > > > i.e. multiple io-worker threads carrying out IO write fops are > > > blocked in posix_writev to execute final write fop pwrite/pwritev in > > > __posix_writev function ONE AFTER ANOTHER. > > > > > > For example: > > > > > > thread1: iot_worker -> ... -> posix_writev() | > > > thread2: iot_worker -> ... -> posix_writev() | > > > thread3: iot_worker -> ... -> posix_writev() -> __posix_writev() > > > thread4: iot_worker -> ... -> posix_writev() | > > > > > > there are 4 iot_worker thread doing the 128KB IO write fops as > > > above, but only one can execute __posix_writev function and the > > > others have to wait. > > > > > > however, if the afr volume is configured on with storage.linux-aio > > > which is off in default, the iot_worker will use posix_aio_writev > > > instead of posix_writev to write data. > > > the posix_aio_writev function won't be affected by > > > GLUSTERFS_WRITE_IS_APPE
Re: [Gluster-devel] 答复: Re: Gluster AFR volume write performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev
On 01/22/2016 07:14 AM, li.ping...@zte.com.cn wrote: Hi Pranith, it is appreciated for your reply. Pranith Kumar Karampuri写于 2016/01/20 18:51:19: > 发件人: Pranith Kumar Karampuri > 收件人: li.ping...@zte.com.cn, gluster-devel@gluster.org, > 日期: 2016/01/20 18:51 > 主题: Re: [Gluster-devel] Gluster AFR volume write performance has > been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev > > Sorry for the delay in response. > On 01/15/2016 02:34 PM, li.ping...@zte.com.cn wrote: > GLUSTERFS_WRITE_IS_APPEND Setting in afr_writev function at > glusterfs client end makes the posix_writev in the server end deal > IO write fops from parallel to serial in consequence. > > i.e. multiple io-worker threads carrying out IO write fops are > blocked in posix_writev to execute final write fop pwrite/pwritev in > __posix_writev function ONE AFTER ANOTHER. > > For example: > > thread1: iot_worker -> ... -> posix_writev() | > thread2: iot_worker -> ... -> posix_writev() | > thread3: iot_worker -> ... -> posix_writev() -> __posix_writev() > thread4: iot_worker -> ... -> posix_writev() | > > there are 4 iot_worker thread doing the 128KB IO write fops as > above, but only one can execute __posix_writev function and the > others have to wait. > > however, if the afr volume is configured on with storage.linux-aio > which is off in default, the iot_worker will use posix_aio_writev > instead of posix_writev to write data. > the posix_aio_writev function won't be affected by > GLUSTERFS_WRITE_IS_APPEND, and the AFR volume write performance goes up. > I think this is a bug :-(. Yeah, I agree with you. I suppose the GLUSTERFS_WRITE_IS_APPEND is a misuse in afr_writev. I checked the original intent of GLUSTERS_WRITE_IS_APPEND change at review website: _http://review.gluster.org/#/c/5501/_ The initial purpose seems to avoid an unnecessary fsync()in afr_changelog_post_op_safe function if the writing data position was currently at the end of the file, detected by (preop.ia_size == offset || (fd->flags & O_APPEND)) in posix_writev. In comparison with the afr write performance loss, I think it costs too much. I suggest to make the GLUSTERS_WRITE_IS_APPEND setting configurable just as ensure-durability in afr. You are right, it doesn't make sense to put this option in dictionary if ensure-durability is off. http://review.gluster.org/13285 addresses this. Do you want to try this out? Thanks for doing most of the work :-). Do let me know if you want to raise a bug for this. Or I can take that up if you don't have time. Pranith > > So, my question is whether AFR volume could work fine with > storage.linux-aio configuration which bypass the > GLUSTERFS_WRITE_IS_APPEND setting in afr_writev, > and why glusterfs keeps posix_aio_writev different from posix_writev ? > > Any replies to clear my confusion would be grateful, and thanks in advance. > What is the workload you have? multiple writers on same file workloads? I test the afr gluster volume by fio like this: fio --filename=/mnt/afr/20G.dat --direct=1 --rw=write --bs=128k --size=20G --numjobs=8 --runtime=60 --group_reporting --name=afr_test --iodepth=1 --ioengine=libaio The Glusterfs BRICKS are two IBM X3550 M3. The local disk direct write performance of 128KB IO req block size is about 18MB/s in single thread and 80MB/s in 8 multi-threads. If the GLUSTERS_WRITE_IS_APPEND is configed, the afr gluster volume write performance is 18MB/s as the single thread, and if not, the performance is nearby 75MB/s.(network bandwith is enough) > > Pranith > > > > ZTE Information Security Notice: The information contained in this > mail (and any attachment transmitted herewith) is privileged and > confidential and is intended for the exclusive use of the addressee > (s). If you are not an intended recipient, any disclosure, > reproduction, distribution or other dissemination or use of the > information contained is strictly prohibited. If you have received > this mail in error, please delete it and notify us immediately. > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ZTE Information Security Notice: The information contained in this mail (and any attachment transmitted herewith) is privileged and confidential and is intended for the exclusive use of the addressee(s). If you are not an intended recipient, any disclosure, reproduction, distribution or other dissemination or use of the information contained is strictly prohibited. If you have received this mail in error, please delete it and notify us immediately. ___ Gluster-devel mailing list Gluster-devel@gluster.org
[Gluster-devel] 答复: Re: Gluster AFR volume write performance has been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev
Hi Pranith, it is appreciated for your reply. Pranith Kumar Karampuri写于 2016/01/20 18:51:19: > 发件人: Pranith Kumar Karampuri > 收件人: li.ping...@zte.com.cn, gluster-devel@gluster.org, > 日期: 2016/01/20 18:51 > 主题: Re: [Gluster-devel] Gluster AFR volume write performance has > been seriously affected by GLUSTERFS_WRITE_IS_APPEND in afr_writev > > Sorry for the delay in response. > On 01/15/2016 02:34 PM, li.ping...@zte.com.cn wrote: > GLUSTERFS_WRITE_IS_APPEND Setting in afr_writev function at > glusterfs client end makes the posix_writev in the server end deal > IO write fops from parallel to serial in consequence. > > i.e. multiple io-worker threads carrying out IO write fops are > blocked in posix_writev to execute final write fop pwrite/pwritev in > __posix_writev function ONE AFTER ANOTHER. > > For example: > > thread1: iot_worker -> ... -> posix_writev() | > thread2: iot_worker -> ... -> posix_writev() | > thread3: iot_worker -> ... -> posix_writev() -> __posix_writev() > thread4: iot_worker -> ... -> posix_writev() | > > there are 4 iot_worker thread doing the 128KB IO write fops as > above, but only one can execute __posix_writev function and the > others have to wait. > > however, if the afr volume is configured on with storage.linux-aio > which is off in default, the iot_worker will use posix_aio_writev > instead of posix_writev to write data. > the posix_aio_writev function won't be affected by > GLUSTERFS_WRITE_IS_APPEND, and the AFR volume write performance goes up. > I think this is a bug :-(. Yeah, I agree with you. I suppose the GLUSTERFS_WRITE_IS_APPEND is a misuse in afr_writev. I checked the original intent of GLUSTERS_WRITE_IS_APPEND change at review website: http://review.gluster.org/#/c/5501/ The initial purpose seems to avoid an unnecessary fsync() in afr_changelog_post_op_safe function if the writing data position was currently at the end of the file, detected by (preop.ia_size == offset || (fd->flags & O_APPEND)) in posix_writev. In comparison with the afr write performance loss, I think it costs too much. I suggest to make the GLUSTERS_WRITE_IS_APPEND setting configurable just as ensure-durability in afr. > > So, my question is whether AFR volume could work fine with > storage.linux-aio configuration which bypass the > GLUSTERFS_WRITE_IS_APPEND setting in afr_writev, > and why glusterfs keeps posix_aio_writev different from posix_writev ? > > Any replies to clear my confusion would be grateful, and thanks in advance. > What is the workload you have? multiple writers on same file workloads? I test the afr gluster volume by fio like this: fio --filename=/mnt/afr/20G.dat --direct=1 --rw=write --bs=128k --size=20G --numjobs=8 --runtime=60 --group_reporting --name=afr_test --iodepth=1 --ioengine=libaio The Glusterfs BRICKS are two IBM X3550 M3. The local disk direct write performance of 128KB IO req block size is about 18MB/s in single thread and 80MB/s in 8 multi-threads. If the GLUSTERS_WRITE_IS_APPEND is configed, the afr gluster volume write performance is 18MB/s as the single thread, and if not, the performance is nearby 75MB/s.(network bandwith is enough) > > Pranith > > > > ZTE Information Security Notice: The information contained in this > mail (and any attachment transmitted herewith) is privileged and > confidential and is intended for the exclusive use of the addressee > (s). If you are not an intended recipient, any disclosure, > reproduction, distribution or other dissemination or use of the > information contained is strictly prohibited. If you have received > this mail in error, please delete it and notify us immediately. > > > > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > http://www.gluster.org/mailman/listinfo/gluster-devel ZTE Information Security Notice: The information contained in this mail (and any attachment transmitted herewith) is privileged and confidential and is intended for the exclusive use of the addressee(s). If you are not an intended recipient, any disclosure, reproduction, distribution or other dissemination or use of the information contained is strictly prohibited. If you have received this mail in error, please delete it and notify us immediately. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel