Re: [ceph-users] xattrs vs. omap with radosgw
We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Hi Sage: You wrote yet - should we earmark it for hammer backport? Nathan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
On Wed, Jun 17, 2015 at 1:02 PM, Nathan Cutler ncut...@suse.cz wrote: We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Hi Sage: You wrote yet - should we earmark it for hammer backport? I'm guessing https://github.com/ceph/ceph/pull/4973 is the backport for hammer (issue http://tracker.ceph.com/issues/11981) Regards Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
On Wed, 17 Jun 2015, Nathan Cutler wrote: We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Hi Sage: You wrote yet - should we earmark it for hammer backport? Yes, please! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] xattrs vs. omap with radosgw
Hi Cephers, While looking at disk utilization on OSD, I noticed the disk was constantly busy with large number of small writes, further investigation showed that, as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which made the xattrs get from local to extents, which incurred extra I/O. I would like to check if anybody has experience with offloading the metadata to omap: 1 Offload everything to omap? If this is the case, should we make the inode size as 512 (instead of 2k)? 2 Partial offload the metadata to omap, e.g. only offloading the rgw specified metadata to omap. Any sharing is deeply appreciated. Thanks! Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
Guang, Try to play around with the following conf attributes specially filestore_max_inline_xattr_size and filestore_max_inline_xattrs // Use omap for xattrs for attrs over // filestore_max_inline_xattr_size or OPTION(filestore_max_inline_xattr_size, OPT_U32, 0) //Override OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536) OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048) OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) // for more than filestore_max_inline_xattrs attrs OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 10) OPTION(filestore_max_inline_xattrs_btrfs, OPT_U32, 10) OPTION(filestore_max_inline_xattrs_other, OPT_U32, 2) I think the behavior for XFS is if the xttrs are more than 10, it will use OMAP. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of GuangYang Sent: Tuesday, June 16, 2015 11:31 AM To: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: [ceph-users] xattrs vs. omap with radosgw Hi Cephers, While looking at disk utilization on OSD, I noticed the disk was constantly busy with large number of small writes, further investigation showed that, as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which made the xattrs get from local to extents, which incurred extra I/O. I would like to check if anybody has experience with offloading the metadata to omap: 1 Offload everything to omap? If this is the case, should we make the inode size as 512 (instead of 2k)? 2 Partial offload the metadata to omap, e.g. only offloading the rgw specified metadata to omap. Any sharing is deeply appreciated. Thanks! Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
On Wed, 17 Jun 2015, Zhou, Yuan wrote: FWIW, there was some discussion in OpenStack Swift and their performance tests showed 255 is not the best in recent XFS. They decided to use large xattr boundary size(65535). https://gist.github.com/smerritt/5e7e650abaa20599ff34 If I read this correctly the total metadata they are setting is pretty big: PILE_O_METADATA = pickle.dumps(dict( (attribute%d % i, hashlib.sha512(thingy %d % i).hexdigest()) for i in range(200))) So lots of small attrs won't really help since they'll have to spill out into other extents eventually no matter what. In our case, we have big (2k) inodes and can easily fit everything in there.. as long as it is in 255 byte pieces. sage -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, June 17, 2015 3:43 AM To: GuangYang Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: xattrs vs. omap with radosgw On Tue, 16 Jun 2015, GuangYang wrote: Hi Cephers, While looking at disk utilization on OSD, I noticed the disk was constantly busy with large number of small writes, further investigation showed that, as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which made the xattrs get from local to extents, which incurred extra I/O. I would like to check if anybody has experience with offloading the metadata to omap: 1 Offload everything to omap? If this is the case, should we make the inode size as 512 (instead of 2k)? 2 Partial offload the metadata to omap, e.g. only offloading the rgw specified metadata to omap. Any sharing is deeply appreciated. Thanks! Hi Guang, Is this hammer or firefly? With hammer the size of object_info_t crossed the 255 byte boundary, which is the max xattr value that XFS can inline. We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Perhaps this is what you're seeing? I think we're still better off with larger XFS inodes and inline xattrs if it means we avoid leveldb at all for most objects. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
After back-porting Sage's patch to Giant, with radosgw, the xattrs can get inline. I haven't run extensive testing yet, will update once I have some performance data to share. Thanks, Guang Date: Tue, 16 Jun 2015 15:51:44 -0500 From: mnel...@redhat.com To: yguan...@outlook.com; s...@newdream.net CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: xattrs vs. omap with radosgw On 06/16/2015 03:48 PM, GuangYang wrote: Thanks Sage for the quick response. It is on Firefly v0.80.4. While trying to put with *rados* directly, the xattrs can be inline. The problem comes to light when using radosgw, since we have a bunch of metadata to keep via xattrs, including: rgw.idtag : 15 bytes rgw.manifest : 381 bytes Ah, that manifest will push us over the limit afaik resulting in every inode getting a new extent. rgw.acl : 121 bytes rgw.etag : 33 bytes Given the background, it looks like the problem is that the rgw.manifest is too large so that XFS make it extents. If I understand correctly, if we port the change to Firefly, we should be able to inline the inode since the accumulated size is still less than 2K (please correct me if I am wrong here). I think you are correct so long as the patch breaks that manifest down into 254 byte or smaller chunks. Thanks, Guang Date: Tue, 16 Jun 2015 12:43:08 -0700 From: s...@newdream.net To: yguan...@outlook.com CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: xattrs vs. omap with radosgw On Tue, 16 Jun 2015, GuangYang wrote: Hi Cephers, While looking at disk utilization on OSD, I noticed the disk was constantly busy with large number of small writes, further investigation showed that, as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which made the xattrs get from local to extents, which incurred extra I/O. I would like to check if anybody has experience with offloading the metadata to omap: 1 Offload everything to omap? If this is the case, should we make the inode size as 512 (instead of 2k)? 2 Partial offload the metadata to omap, e.g. only offloading the rgw specified metadata to omap. Any sharing is deeply appreciated. Thanks! Hi Guang, Is this hammer or firefly? With hammer the size of object_info_t crossed the 255 byte boundary, which is the max xattr value that XFS can inline. We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Perhaps this is what you're seeing? I think we're still better off with larger XFS inodes and inline xattrs if it means we avoid leveldb at all for most objects. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
Hi Yuan, Thanks for sharing the link, it is interesting to read. My understanding of the test results, is that with a fixed size of xattrs, using smaller stripe size will incur larger latency for read, which kind of makes sense since there are more k-v pairs, and with the size, it needs to get extents anyway. Correct me if I am wrong here... Thanks, Guang From: yuan.z...@intel.com To: s...@newdream.net; yguan...@outlook.com CC: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: RE: xattrs vs. omap with radosgw Date: Wed, 17 Jun 2015 01:32:35 + FWIW, there was some discussion in OpenStack Swift and their performance tests showed 255 is not the best in recent XFS. They decided to use large xattr boundary size(65535). https://gist.github.com/smerritt/5e7e650abaa20599ff34 -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, June 17, 2015 3:43 AM To: GuangYang Cc: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com Subject: Re: xattrs vs. omap with radosgw On Tue, 16 Jun 2015, GuangYang wrote: Hi Cephers, While looking at disk utilization on OSD, I noticed the disk was constantly busy with large number of small writes, further investigation showed that, as radosgw uses xattrs to store metadata (e.g. etag, content-type, etc.), which made the xattrs get from local to extents, which incurred extra I/O. I would like to check if anybody has experience with offloading the metadata to omap: 1 Offload everything to omap? If this is the case, should we make the inode size as 512 (instead of 2k)? 2 Partial offload the metadata to omap, e.g. only offloading the rgw specified metadata to omap. Any sharing is deeply appreciated. Thanks! Hi Guang, Is this hammer or firefly? With hammer the size of object_info_t crossed the 255 byte boundary, which is the max xattr value that XFS can inline. We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Perhaps this is what you're seeing? I think we're still better off with larger XFS inodes and inline xattrs if it means we avoid leveldb at all for most objects. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com