Re: consequences of XFS_IOC_FSSETXATTR on non-empty file?
On Mon, Jul 14, 2014 at 2:55 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Jul 13, 2014 at 09:01:13PM +0400, Ilya Dryomov wrote: On Sun, Jul 13, 2014 at 5:48 AM, Samuel Just sam.j...@inktank.com wrote: Actually, on this ubuntu kernel (3.13.0-24-generic), it doesn't seem to give an error. I'll attach my test case for that. We don't yet have a way of reproducing the corruption -- the ext_size change in the osd simply seemed like a promising lead. -Sam On Sat, Jul 12, 2014 at 6:26 PM, Dave Chinner da...@fromorbit.com wrote: On Sat, Jul 12, 2014 at 06:16:54PM -0700, Samuel Just wrote: Hi, We are seeing reports of ceph-osd stores on xfs of files with some garbage data (possibly misplaced from data elsewhere in the filesystem). There was a bug for a while where the ceph-osd process would set a value for fsx_extsize on a non-empty (possibly sparse) file using XFS_IOC_FSSETXATTR. Could that plausibly result in a file with garbage data? No, setting an extent size on a non-empty file will simply fail with EINVAL. AFAIR it checks whether or not any extents are actually allocated, not whether the file is empty or not. FWIW, that's an implementation detail, not the definition of the intended behaviour of the ioctl. Indeed, the man page says: The fsx_xflags realtime file bit and the file's extent size may be changed only when the file is empty, ... For most people, [non-]empty file is much more easily understood than a file without real extents, but might have been written to and so have dirty, in-memory delayed allocation data whose asynchronous flushing may or may not affect the behaviour of a call to XFS_IOC_FSSETXATTR. i.e. the intended application behaviour is that they should only be able to change the extent size hint *before* any data is written to the file. I think if you call fsync() or even fdatasync() before close(fd), it will fail as expected. Only if you are trying to change the extent size immediately after the first write you do to an empty file. Which is, as per the above, not the recommended or intended use of the ioctl. That's understood, but that is exactly what Sam's test program happens to try to do, so I had to point the file w/o real extents thing out. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EC pool - empty files in OSD from radosgw
Hi Yehuda and Sam, Any suggestion on top of the below issue? Thanks, Guang On Jul 12, 2014, at 12:43 AM, Guang Yang yguan...@outlook.com wrote: Hi Loic, I opened an issue in terms of a change brought along with EC pool plus radosgw (http://tracker.ceph.com/issues/8625), in our test cluster, we observed a large number of empty files in OSD and the root cause is that for head object from radosgw, there are a couple of transactions coming together, including create 0~0, setxattr, writefull, as EC bring the concept of the object generation, the create transaction will first create an object and the following write full transition will be taken as an update and rename the original empty file to a generation and create/write a new file. As a result, we observed quite some empty files. There is an bug tracking the effort to remove those files with generation and is pending to be back port to firefly, that could definitely help our use case, however, I am also wondering if there is any room to improve for such case so that those empty files would not be generated in the first place (change might be at radosgw side). Any suggestion is welcomed. Thanks, Guang -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.80.2?
Hi, Ideally the thing to do here is run s3-tests on your end and confirm that the tests are failing with the patch and figure out why. Or, if it passes for you, we can figure out what is different between your environment and QA. And then, ideally, we can extend s3-tests to reproduce the failure that you are seeing with the patch reverted so that we can confirm it is fixed (and stays fixed). I just tried some tests on various revisions. It's not s3-tests because I couldn't figure out how to just run the relevant test (trying to just select the multipart test yields 0 tests run ... and the full suite fails because I run lighttpd and some error codes are different, and it also takes forever) So I have my own create bucket - upload multi part - download it back - delete bucket test with just boto in python. It's part of the test I use internally to validate S3 before deploying it to prod (it tests all the stuff we use including some weird stuff). I know the test fails on 0.80.1. I also knows it works fine on our own internal package of firefly (which are based on some intermediate version of firefly branch between 0.80.1 and 0.80.2 + some patches I backported myself). * 8589c108dd5f00471986deae72d88811ccf0627f (which should be 0.80.3) - Fails to do the multipart upload * Re-applying b300318113b162522759d4794b1cfa3d1d9398e4 over it - Actually crashes / segfault ... so yeah, really not good. So I tried going back to d43e7113dd501aea1db33fdae30d56e96e9c3897 which is the revision I based my packages on. And I applied b300318113b162522759d4794b1cfa3d1d9398e4. - Multi part upload works and no crash ! So something between d43e7113dd501aea1db33fdae30d56e96e9c3897 and 8589c108dd5f00471986deae72d88811ccf0627f is causing the fix to misbehave ... Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
Hi all, I have viewed the discuss video on Ceph CDS. By the way, sorry for the absence because of something urgent. It seemed that we have two ways to implement it, one is lightweight another is complex. I like the simple one which prefer invalidating cache and let librbd reload/lazy load object state. And the most important one is implementing a performance optimized Index(ObjectMap). Is there exists progress Josh? I think we could push further based on discuss. Or I missed something? On Wed, Jun 11, 2014 at 12:01 PM, Gregory Farnum g...@inktank.com wrote: We discussed a great deal of this during the initial format 2 work as well, when we were thinking about having bitmaps of allocated space. (Although we also have interval sets which might be a better fit?) I think there was more thought behind it than is in the copy-on-read blueprint; do you know if we have it written down anywhere, Josh? -Greg On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin josh.dur...@inktank.com wrote: On Tue, 10 Jun 2014 14:52:54 +0800 Haomai Wang haomaiw...@gmail.com wrote: Thanks, Josh! Your points are really helpful. Maybe we can schedule this bp to the near CDS? The implementation I hope can has great performance effects on librbd. It'd be great to discuss it more at CDS. Could you add a blueprint for it on the wiki: https://wiki.ceph.com/Planning/Blueprints/Submissions Josh On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin josh.dur...@inktank.com wrote: On 06/05/2014 12:01 AM, Haomai Wang wrote: Hi, Previously I sent a mail about the difficult of rbd snapshot size statistic. The main solution is using object map to store the changes. The problem is we can't handle with multi client concurrent modify. Lack of object map(like pointer map in qcow2), it cause many problems in librbd. Such as clone depth, the deep clone depth will cause remarkable latency. Usually each clone wrap will increase two times of latency. I consider to make a tradeoff between multi-client support and single-client support for librbd. In practice, most of the volumes/images are used by VM, there only exist one client will access/modify image. We can't only want to make shared image possible but make most of use cases bad. So we can add a new flag called shared when creating image. If shared is false, librbd will maintain a object map for each image. The object map is considered to durable, each image_close call will store the map into rados. If the client is crashed and failed to dump the object map, the next client open the image will think the object map as out of date and reset the objectmap. We can easily find the advantage of this feature: 1. Avoid clone performance problem 2. Make snapshot statistic possible 3. Improve librbd operation performance including read, copy-on-write operation. What do you think above? More feedbacks are appreciate! I think it's a great idea! We discussed this a little at the last cds [1]. I like the idea of the shared flag on an image. Since the vastly more common case is single-client, I'd go further and suggest that we treat images as if shared is false by default if the flag is not present (perhaps with a config option to change this default behavior). That way existing images can benefit from the feature without extra configuration. There can be an rbd command to toggle the shared flag as well, so users of ocfs2 or gfs2 or other multi-client-writing systems can upgrade and set shared to true before restarting their clients. Another thing to consider is the granularity of the object map. The coarse granularity of a bitmap of object existence would be simplest, and most useful for in-memory comparison for clones. For statistics it might be desirable in the future to have a finer-grained index of data existence in the image. To make that easy to handle, the on-disk format could be a list of extents (byte ranges). Another potential use case would be a mode in which the index is treated as authoritative. This could make discard very fast, for example. I'm not sure it could be done safely with only binary 'exists/does not exist' information though - a third 'unknown' state might be needed for some cases. If this kind of index is actually useful (I'm not sure there are cases where the performance penalty would be worth it), we could add a new index format if we need it. Back to the currently proposed design, to be safe with live migration we'd need to make sure the index is consistent in the destination process. Using rados_notify() after we set the clean flag on the index can make the destination vm re-read the index before any I/O happens. This might be a good time to introduce a data payload to the notify as well, so we can only re-read the index, instead of all the header metadata. Rereading the index after cache invalidation and wiring that up through qemu's
ceph branch status
-- All Branches -- Alfredo Deza alfredo.d...@inktank.com 2013-09-27 10:33:52 -0400 wip-5900 2014-07-08 13:58:35 -0400 wip-8679 Dan Mick dan.m...@inktank.com 2013-07-16 23:00:06 -0700 wip-5634 David Zafman david.zaf...@inktank.com 2014-06-05 00:22:34 -0700 wip-8231 Greg Farnum g...@inktank.com 2013-02-13 14:46:38 -0800 wip-mds-snap-fix 2013-02-22 19:57:53 -0800 wip-4248-snapid-journaling 2013-09-30 14:37:49 -0700 wip-filestore-test 2013-10-09 13:31:38 -0700 cuttlefish-4832 2013-11-15 14:41:51 -0800 wip-librados-command2 2013-12-09 16:21:41 -0800 wip-hitset-snapshots 2014-01-29 08:44:01 -0800 wip-filestore-fast-lookup 2014-04-28 14:51:59 -0700 wip-messenger-locking 2014-05-20 13:36:10 -0700 wip-xattr-spillout-basic 2014-05-29 14:54:29 -0700 wip-client-fast-dispatch 2014-06-16 14:57:41 -0700 wip-8519-osd-unblocking 2014-07-01 15:19:21 -0700 dumpling Ilya Dryomov ilya.dryo...@inktank.com 2014-06-27 14:51:39 +0400 wip-dis-tmp James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Jenkins jenk...@inktank.com 2014-07-09 19:32:26 + last Joao Eduardo Luis joao.l...@inktank.com 2013-04-18 00:01:24 +0100 wip-4521-tool 2013-04-22 15:14:28 +0100 wip-4748 2013-04-24 16:42:11 +0100 wip-4521 2013-04-30 18:45:22 +0100 wip-mon-compact-dbg 2013-05-21 01:46:13 +0100 wip-monstoretool-foo 2013-05-31 16:26:02 +0100 wip-mon-cache-first-last-committed 2013-05-31 21:00:28 +0100 wip-mon-trim-b 2013-07-20 04:30:59 +0100 wip-mon-caps-test 2013-07-23 16:21:46 +0100 wip-5704-cuttlefish 2013-07-23 17:35:59 +0100 wip-5704 2013-08-02 22:54:42 +0100 wip-5648 2013-08-12 11:21:29 -0700 wip-store-tool.cuttlefish 2013-09-25 22:08:24 +0100 wip-6378 2013-10-10 14:06:59 +0100 wip-mon-set-pspool 2013-12-09 16:39:19 + wip-mon-mdsmap-trim.dumpling 2013-12-18 22:17:09 + wip-monstoretool-genmdsmaps 2014-01-17 17:11:59 -0800 wip-fix-pipe-comment-for-fhaas 2014-02-02 14:10:39 + wip-7277.for-loic 2014-04-04 22:32:41 +0100 wip-mon-fix 2014-04-21 15:55:28 +0100 wip-7514 2014-04-22 17:58:58 +0100 wip-8165-joao 2014-06-24 23:16:17 +0100 wip-8624-with-amazing-foo 2014-07-11 16:06:02 +0100 wip-8696.with-test-mdsfixes John Spray jcsp...@gmail.com 2014-03-03 13:10:05 + wip-mds-stop-rank-0 John Spray john.sp...@inktank.com 2014-03-06 13:01:25 + wip-mds-debug 2014-04-01 15:06:19 +0100 wip-5382 John Spray john.sp...@redhat.com 2014-06-25 22:54:13 -0400 wip-mds-sessions 2014-07-09 12:43:04 +0100 wip-vstart-test 2014-07-09 17:08:55 +0100 wip-mds-session-asok 2014-07-10 17:15:46 +0100 wip-set_layout 2014-07-14 14:15:27 +0100 wip-mds-session-asok-squash John Spray jsp...@redhat.com 2014-07-08 20:03:21 +0100 wip-layout-test John Wilkins john.wilk...@inktank.com 2013-07-31 18:00:50 -0700 wip-doc-rados-python-api 2014-07-03 07:31:14 -0700 wip-doc-rgw-federated Josh Durgin josh.dur...@inktank.com 2013-03-01 14:45:23 -0800 wip-rbd-workunit-debug 2013-07-25 18:44:10 -0700 wip-5488-2 2013-08-14 15:51:04 -0700 wip-5970 2013-08-27 12:03:08 -0700 wip-krbd-workunits 2013-11-22 15:17:08 -0800 wip-zero-copy-bufferlist 2013-11-25 13:59:29 -0800 wip-init-highlander 2013-12-17 08:16:59 -0800 wip-rbd-deadlock-lockdep 2013-12-18 12:28:39 -0800 wip-rbd-deadlock-lockdep-dumpling 2013-12-26 18:06:39 -0800 emperor-5426 2013-12-26 18:07:13 -0800 dumpling-5426 2014-02-06 20:31:43 -0800 wip-librados-obj-ops 2014-02-06 20:31:47 -0800 wip-librados-op-rvals 2014-03-03 14:27:39 -0800 wip-object-cacher-memory 2014-04-08 15:13:44 -0700 wip-6480-0.67.7 2014-04-27 14:13:07 -0700 wip-rbd-header-dump 2014-06-06 15:40:12 -0700 wip-libradosstriper 2014-07-08 14:23:05 -0700 wip-rgw-firefly-3 2014-07-08 17:26:30 -0700 wip-rocksdb Loic Dachary l...@dachary.org 2014-07-02 21:09:59 +0200 wip-8475 2014-07-03 12:46:15 +0200 wip-7238-lrc Matt Benjamin m...@linuxbox.com 2013-10-08 16:49:23 -0400 wip-libcephfs-emp-rb Mohammad Salehe salehe+...@gmail.com 2014-05-27 18:58:53 +0430 wip-pybind-fixes Noah Watkins noah.watk...@inktank.com 2013-01-05 11:58:38 -0800 wip-localized-read-tests 2013-10-18 15:42:50 -0700 cls-lua 2013-11-05 07:30:19 -0800 port/old 2013-11-06 08:39:57 -0800 wip-6636 2013-11-26 08:26:24 -0800 wip-boost-uuid 2013-12-30 09:47:40 -0800
Re: v0.80.2?
Here's the culprit IMHO: ea68b9372319fd0bab40856db26528d36359102e rgw: don't allow multiple writers to same multiobject part Fixes: #8269 Backport: firefly, dumpling A client might need to retry a multipart part write. The original thread might race with the new one, trying to clean up after it, clobbering the part's data. The fix is to detect whether an original part already existed, and if so use a different part name for it. Signed-off-by: Yehuda Sadeh yeh...@inktank.com (cherry picked from commit bd8e026f88b812cc70caf6232c247844df5d99bf) This is the first commit that, if you apply b300318113b162522759d4794b1cfa 3d1d9398e4 over it, will cause a segfault. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: consequences of XFS_IOC_FSSETXATTR on non-empty file?
On Mon, Jul 14, 2014 at 11:24:05AM +0400, Ilya Dryomov wrote: On Mon, Jul 14, 2014 at 2:55 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Jul 13, 2014 at 09:01:13PM +0400, Ilya Dryomov wrote: On Sun, Jul 13, 2014 at 5:48 AM, Samuel Just sam.j...@inktank.com wrote: I think if you call fsync() or even fdatasync() before close(fd), it will fail as expected. Only if you are trying to change the extent size immediately after the first write you do to an empty file. Which is, as per the above, not the recommended or intended use of the ioctl. That's understood, but that is exactly what Sam's test program happens to try to do, so I had to point the file w/o real extents thing out. Oh, I missed that there was a test program attached. That's what happens when people top post a reply then attach text at the bottom Moral of the story: don't top post ;) Cheersm Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rpm dependencies
Hey Luis, I pushed wip-dencoder, which moves ceph-dencoder to ceph from ceph-common. This avoids the dependency for ceph-common. For the librados2 package, though, there is still a leveldb dependency that makes no sense. See the packages at http://gitbuilder.ceph.com/ceph-rpm-rhel6_5-x86_64-basic/ref/wip-dencoder/x86_64/ $ wget http://gitbuilder.ceph.com/ceph-rpm-rhel6_5-x86_64-basic/ref/wip-dencoder/x86_64/librados2-0.82-595.g95f5a44.el6.x86_64.rpm $ rpm -qpR librados2-0.82-595.g95f5a44.el6.x86_64.rpm | grep level warning: librados2-0.82-595.g95f5a44.el6.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 03c3951a: NOKEY libleveldb.so.1()(64bit) $ rpm2cpio librados2-0.82-595.g95f5a44.el6.x86_64.rpm | cpio -i --make-directories ... $ ldd usr/lib/librados2.so.2 | grep -c leveldb 1 Our Makefile.am doesn't specify that we link against leveldb for librados, although strangely it is listed in the librados.la file anyway (along with every other lib anything links against). Despite this, ldd .libs/librados.so on my (debian) box doesn't list it when building locally. The Makefile-env.am has --as-needed, which is supposed to make us only use dependencies we actually use. Not sure what is going on with the RPM build that is putting this dependency in place... Any idea what is going on? :/ sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Read from clones
Hi, Following the below set of steps, we are seeing data loss while reading from clones. 1)Create an image with image format 2 (in this case we have made the size to be 1024MB). rbd create image1 --size 1024 --image-format 2 2)Map the image and write 1024MB worth of data to it. 3)Create a snapshot for the image created in step 1) rbd snap create image1@snap1 4)Create a clone for the snapshot created in step 3) rbd clone image1@snap1 clone1 5)Create a snapshot for the clone created in step 4) rbd snap create clone1@snap2 6)Create a clone for the snapshot created in step 5) rbd clone clone1@snap2 clone2 7)Shrink the size of the clone created in step 4) (in this case we have made it to half of its size) rbd resize -s 512 --allow-shrink clone1 8)Map the clone created in step 6) and try reading 1024MB worth of data. 9)Our observation is that, only the first 512MB worth data is intact, the rest is not copied over. (In fact, it's only the parent overlap worth data of clone1 that is always copied over!) After the above set of steps, the parent overlap for clone2 would be 1024MB, whereas the parent overlap for clone1 would be 512MB. Our understanding is, since clone2's parent snapshot is taken before a shrink is performed on clone1, any reads worth parent overlap data on clone2 should be serviced from it's parent (at least, as long as there are no overwrites done on clone2, which is the case here), and we are not finding that to be true in this case. To augment our theory, if the parent image (a base RBD image, which is not a clone) is shrinked, any reads on the clones that are created before the shrink, are as expected to our understanding. Wanted to check if this is indeed a bug or if we are missing anything here. The tests are run across ceph version 0.80. Thanks, M.Laks... PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html