Re: consequences of XFS_IOC_FSSETXATTR on non-empty file?

2014-07-14 Thread Ilya Dryomov
On Mon, Jul 14, 2014 at 2:55 AM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Jul 13, 2014 at 09:01:13PM +0400, Ilya Dryomov wrote:
 On Sun, Jul 13, 2014 at 5:48 AM, Samuel Just sam.j...@inktank.com wrote:
  Actually, on this ubuntu kernel (3.13.0-24-generic), it doesn't seem
  to give an error.  I'll attach my test case for that.  We don't yet
  have a way of reproducing the corruption -- the ext_size change in the
  osd simply seemed like a promising lead.
  -Sam
 
  On Sat, Jul 12, 2014 at 6:26 PM, Dave Chinner da...@fromorbit.com wrote:
  On Sat, Jul 12, 2014 at 06:16:54PM -0700, Samuel Just wrote:
  Hi,
 
  We are seeing reports of ceph-osd stores on xfs of files with some
  garbage data (possibly misplaced from data elsewhere in the
  filesystem).  There was a bug for a while where the ceph-osd process
  would set a value for fsx_extsize on a non-empty (possibly sparse)
  file using XFS_IOC_FSSETXATTR.  Could that plausibly result in a file
  with garbage data?
 
  No, setting an extent size on a non-empty file will simply fail
  with EINVAL.

 AFAIR it checks whether or not any extents are actually allocated, not
 whether the file is empty or not.

 FWIW, that's an implementation detail, not the definition of the
 intended behaviour of the ioctl.  Indeed, the man page says:

 The fsx_xflags realtime file bit and the file's extent size may be
 changed only when the file is empty, ...

 For most people, [non-]empty file is much more easily understood
 than a file without real extents, but might have been written to
 and so have dirty, in-memory delayed allocation data whose
 asynchronous flushing may or may not affect the behaviour of a call
 to XFS_IOC_FSSETXATTR.

 i.e. the intended application behaviour is that they should only be
 able to change the extent size hint *before* any data is written to
 the file.

 I think if you call fsync() or even
 fdatasync() before close(fd), it will fail as expected.

 Only if you are trying to change the extent size immediately after
 the first write you do to an empty file. Which is, as per the above,
 not the recommended or intended use of the ioctl.

That's understood, but that is exactly what Sam's test program happens
to try to do, so I had to point the file w/o real extents thing out.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EC pool - empty files in OSD from radosgw

2014-07-14 Thread Guang Yang
Hi Yehuda and Sam,
Any suggestion on top of the below issue?

Thanks,
Guang

On Jul 12, 2014, at 12:43 AM, Guang Yang yguan...@outlook.com wrote:

 Hi Loic,
 I opened an issue in terms of a change brought along with EC pool plus 
 radosgw (http://tracker.ceph.com/issues/8625), in our test cluster, we 
 observed a large number of empty files in OSD and the root cause is that for 
 head object from radosgw, there are a couple of transactions coming together, 
 including create 0~0, setxattr, writefull, as EC bring the concept of the 
 object generation, the create transaction will first create an object and the 
 following write full transition will be taken as an update and rename the 
 original empty file to a generation and create/write a new file. As a result, 
 we observed quite some empty files.
 
 There is an bug tracking the effort to remove those files with generation and 
 is pending to be back port to firefly, that could definitely help our use 
 case, however, I am also wondering if there is any room to improve for such 
 case so that those empty files would not be generated in the first place 
 (change might be at radosgw side).
 
 Any suggestion is welcomed.
 
 Thanks,
 Guang

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.80.2?

2014-07-14 Thread Sylvain Munaut
Hi,

 Ideally the thing to do here is run s3-tests on your end and confirm that
 the tests are failing with the patch and figure out why.  Or, if it passes
 for you, we can figure out what is different between your environment and
 QA.  And then, ideally, we can extend s3-tests to reproduce the failure
 that you are seeing with the patch reverted so that we can confirm it is
 fixed (and stays fixed).

I just tried some tests on various revisions. It's not s3-tests
because I couldn't figure out how to just run the relevant test
(trying to just select the multipart test yields 0 tests run ... and
the full suite fails because I run lighttpd and some error codes are
different, and it also takes forever)

So I have my own create bucket - upload multi part - download it back
- delete bucket  test with just boto in python.
It's part of the test I use internally to validate S3 before deploying
it to prod (it tests all the stuff we use including some weird stuff).

I know the test fails on 0.80.1. I also knows it works fine on our own
internal package of firefly (which are based on some intermediate
version of firefly branch between 0.80.1 and 0.80.2 + some patches I
backported myself).


* 8589c108dd5f00471986deae72d88811ccf0627f  (which should be 0.80.3)
   - Fails to do the multipart upload

* Re-applying b300318113b162522759d4794b1cfa3d1d9398e4 over it
   - Actually crashes / segfault ... so yeah, really not good.

So I tried going back to d43e7113dd501aea1db33fdae30d56e96e9c3897
which is the revision I based my packages on. And I applied
b300318113b162522759d4794b1cfa3d1d9398e4.
   - Multi part upload works and no crash !


So something between d43e7113dd501aea1db33fdae30d56e96e9c3897 and
8589c108dd5f00471986deae72d88811ccf0627f is causing the fix to
misbehave ...


Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose

2014-07-14 Thread Haomai Wang
Hi all,

I have viewed the discuss video on Ceph CDS. By the way, sorry for the
absence because of something urgent.

It seemed that we have two ways to implement it, one is lightweight
another is complex. I like the simple one which prefer invalidating
cache and let librbd reload/lazy load object state. And the most
important one is implementing a performance optimized
Index(ObjectMap).

Is there exists progress Josh? I think we could push further based on
discuss. Or I missed something?



On Wed, Jun 11, 2014 at 12:01 PM, Gregory Farnum g...@inktank.com wrote:
 We discussed a great deal of this during the initial format 2 work as
 well, when we were thinking about having bitmaps of allocated space.
 (Although we also have interval sets which might be a better fit?) I
 think there was more thought behind it than is in the copy-on-read
 blueprint; do you know if we have it written down anywhere, Josh?
 -Greg

 On Tue, Jun 10, 2014 at 12:38 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On Tue, 10 Jun 2014 14:52:54 +0800
 Haomai Wang haomaiw...@gmail.com wrote:

 Thanks, Josh!

 Your points are really helpful. Maybe we can schedule this bp to the
 near CDS? The implementation I hope can has great performance effects
 on librbd.

 It'd be great to discuss it more at CDS. Could you add a blueprint for
 it on the wiki:

 https://wiki.ceph.com/Planning/Blueprints/Submissions

 Josh

 On Tue, Jun 10, 2014 at 9:16 AM, Josh Durgin
 josh.dur...@inktank.com wrote:
  On 06/05/2014 12:01 AM, Haomai Wang wrote:
  Hi,
  Previously I sent a mail about the difficult of rbd snapshot size
  statistic. The main solution is using object map to store the
  changes. The problem is we can't handle with multi client
  concurrent modify.
 
  Lack of object map(like pointer map in qcow2), it cause many
  problems in librbd. Such as clone depth, the deep clone depth will
  cause remarkable latency. Usually each clone wrap will increase
  two times of latency.
 
  I consider to make a tradeoff between multi-client support and
  single-client support for librbd. In practice, most of the
  volumes/images are used by VM, there only exist one client will
  access/modify image. We can't only want to make shared image
  possible but make most of use cases bad. So we can add a new flag
  called shared when creating image. If shared is false, librbd
  will maintain a object map for each image. The object map is
  considered to durable, each image_close call will store the map
  into rados. If the client  is crashed and failed to dump the
  object map, the next client open the image will think the object
  map as out of date and reset the objectmap.
 
  We can easily find the advantage of this feature:
  1. Avoid clone performance problem
  2. Make snapshot statistic possible
  3. Improve librbd operation performance including read,
  copy-on-write operation.
 
  What do you think above? More feedbacks are appreciate!
 
  I think it's a great idea! We discussed this a little at the last
  cds [1]. I like the idea of the shared flag on an image. Since the
  vastly more common case is single-client, I'd go further and
  suggest that we treat images as if shared is false by default if
  the flag is not present (perhaps with a config option to change
  this default behavior).
 
  That way existing images can benefit from the feature without extra
  configuration. There can be an rbd command to toggle the shared
  flag as well, so users of ocfs2 or gfs2 or other
  multi-client-writing systems can upgrade and set shared to true
  before restarting their clients.
 
  Another thing to consider is the granularity of the object map. The
  coarse granularity of a bitmap of object existence would be
  simplest, and most useful for in-memory comparison for clones. For
  statistics it might be desirable in the future to have a
  finer-grained index of data existence in the image. To make that
  easy to handle, the on-disk format could be a list of extents (byte
  ranges).
 
  Another potential use case would be a mode in which the index is
  treated as authoritative. This could make discard very fast, for
  example. I'm not sure it could be done safely with only binary
  'exists/does not exist' information though - a third 'unknown' state
  might be needed for some cases. If this kind of index is actually
  useful (I'm not sure there are cases where the performance penalty
  would be worth it), we could add a new index format if we need it.
 
  Back to the currently proposed design, to be safe with live
  migration we'd need to make sure the index is consistent in the
  destination process. Using rados_notify() after we set the clean
  flag on the index can make the destination vm re-read the index
  before any I/O happens. This might be a good time to introduce a
  data payload to the notify as well, so we can only re-read the
  index, instead of all the header metadata. Rereading the index
  after cache invalidation and wiring that up through qemu's
  

ceph branch status

2014-07-14 Thread ceph branch robot
-- All Branches --

Alfredo Deza alfredo.d...@inktank.com
2013-09-27 10:33:52 -0400   wip-5900
2014-07-08 13:58:35 -0400   wip-8679

Dan Mick dan.m...@inktank.com
2013-07-16 23:00:06 -0700   wip-5634

David Zafman david.zaf...@inktank.com
2014-06-05 00:22:34 -0700   wip-8231

Greg Farnum g...@inktank.com
2013-02-13 14:46:38 -0800   wip-mds-snap-fix
2013-02-22 19:57:53 -0800   wip-4248-snapid-journaling
2013-09-30 14:37:49 -0700   wip-filestore-test
2013-10-09 13:31:38 -0700   cuttlefish-4832
2013-11-15 14:41:51 -0800   wip-librados-command2
2013-12-09 16:21:41 -0800   wip-hitset-snapshots
2014-01-29 08:44:01 -0800   wip-filestore-fast-lookup
2014-04-28 14:51:59 -0700   wip-messenger-locking
2014-05-20 13:36:10 -0700   wip-xattr-spillout-basic
2014-05-29 14:54:29 -0700   wip-client-fast-dispatch
2014-06-16 14:57:41 -0700   wip-8519-osd-unblocking
2014-07-01 15:19:21 -0700   dumpling

Ilya Dryomov ilya.dryo...@inktank.com
2014-06-27 14:51:39 +0400   wip-dis-tmp

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Jenkins jenk...@inktank.com
2014-07-09 19:32:26 +   last

Joao Eduardo Luis joao.l...@inktank.com
2013-04-18 00:01:24 +0100   wip-4521-tool
2013-04-22 15:14:28 +0100   wip-4748
2013-04-24 16:42:11 +0100   wip-4521
2013-04-30 18:45:22 +0100   wip-mon-compact-dbg
2013-05-21 01:46:13 +0100   wip-monstoretool-foo
2013-05-31 16:26:02 +0100   wip-mon-cache-first-last-committed
2013-05-31 21:00:28 +0100   wip-mon-trim-b
2013-07-20 04:30:59 +0100   wip-mon-caps-test
2013-07-23 16:21:46 +0100   wip-5704-cuttlefish
2013-07-23 17:35:59 +0100   wip-5704
2013-08-02 22:54:42 +0100   wip-5648
2013-08-12 11:21:29 -0700   wip-store-tool.cuttlefish
2013-09-25 22:08:24 +0100   wip-6378
2013-10-10 14:06:59 +0100   wip-mon-set-pspool
2013-12-09 16:39:19 +   wip-mon-mdsmap-trim.dumpling
2013-12-18 22:17:09 +   wip-monstoretool-genmdsmaps
2014-01-17 17:11:59 -0800   wip-fix-pipe-comment-for-fhaas
2014-02-02 14:10:39 +   wip-7277.for-loic
2014-04-04 22:32:41 +0100   wip-mon-fix
2014-04-21 15:55:28 +0100   wip-7514
2014-04-22 17:58:58 +0100   wip-8165-joao
2014-06-24 23:16:17 +0100   wip-8624-with-amazing-foo
2014-07-11 16:06:02 +0100   wip-8696.with-test-mdsfixes

John Spray jcsp...@gmail.com
2014-03-03 13:10:05 +   wip-mds-stop-rank-0

John Spray john.sp...@inktank.com
2014-03-06 13:01:25 +   wip-mds-debug
2014-04-01 15:06:19 +0100   wip-5382

John Spray john.sp...@redhat.com
2014-06-25 22:54:13 -0400   wip-mds-sessions
2014-07-09 12:43:04 +0100   wip-vstart-test
2014-07-09 17:08:55 +0100   wip-mds-session-asok
2014-07-10 17:15:46 +0100   wip-set_layout
2014-07-14 14:15:27 +0100   wip-mds-session-asok-squash

John Spray jsp...@redhat.com
2014-07-08 20:03:21 +0100   wip-layout-test

John Wilkins john.wilk...@inktank.com
2013-07-31 18:00:50 -0700   wip-doc-rados-python-api
2014-07-03 07:31:14 -0700   wip-doc-rgw-federated

Josh Durgin josh.dur...@inktank.com
2013-03-01 14:45:23 -0800   wip-rbd-workunit-debug
2013-07-25 18:44:10 -0700   wip-5488-2
2013-08-14 15:51:04 -0700   wip-5970
2013-08-27 12:03:08 -0700   wip-krbd-workunits
2013-11-22 15:17:08 -0800   wip-zero-copy-bufferlist
2013-11-25 13:59:29 -0800   wip-init-highlander
2013-12-17 08:16:59 -0800   wip-rbd-deadlock-lockdep
2013-12-18 12:28:39 -0800   wip-rbd-deadlock-lockdep-dumpling
2013-12-26 18:06:39 -0800   emperor-5426
2013-12-26 18:07:13 -0800   dumpling-5426
2014-02-06 20:31:43 -0800   wip-librados-obj-ops
2014-02-06 20:31:47 -0800   wip-librados-op-rvals
2014-03-03 14:27:39 -0800   wip-object-cacher-memory
2014-04-08 15:13:44 -0700   wip-6480-0.67.7
2014-04-27 14:13:07 -0700   wip-rbd-header-dump
2014-06-06 15:40:12 -0700   wip-libradosstriper
2014-07-08 14:23:05 -0700   wip-rgw-firefly-3
2014-07-08 17:26:30 -0700   wip-rocksdb

Loic Dachary l...@dachary.org
2014-07-02 21:09:59 +0200   wip-8475
2014-07-03 12:46:15 +0200   wip-7238-lrc

Matt Benjamin m...@linuxbox.com
2013-10-08 16:49:23 -0400   wip-libcephfs-emp-rb

Mohammad Salehe salehe+...@gmail.com
2014-05-27 18:58:53 +0430   wip-pybind-fixes

Noah Watkins noah.watk...@inktank.com
2013-01-05 11:58:38 -0800   wip-localized-read-tests
2013-10-18 15:42:50 -0700   cls-lua
2013-11-05 07:30:19 -0800   port/old
2013-11-06 08:39:57 -0800   wip-6636
2013-11-26 08:26:24 -0800   wip-boost-uuid
2013-12-30 09:47:40 -0800   

Re: v0.80.2?

2014-07-14 Thread Sylvain Munaut
Here's the culprit IMHO:


ea68b9372319fd0bab40856db26528d36359102e


rgw: don't allow multiple writers to same multiobject part

Fixes: #8269
Backport: firefly, dumpling

A client might need to retry a multipart part write. The original thread
might race with the new one, trying to clean up after it, clobbering the
part's data.
The fix is to detect whether an original part already existed, and if so
use a different part name for it.

Signed-off-by: Yehuda Sadeh yeh...@inktank.com
(cherry picked from commit bd8e026f88b812cc70caf6232c247844df5d99bf)


This is the first commit that, if you apply b300318113b162522759d4794b1cfa
3d1d9398e4 over it, will cause a segfault.


Cheers,

   Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: consequences of XFS_IOC_FSSETXATTR on non-empty file?

2014-07-14 Thread Dave Chinner
On Mon, Jul 14, 2014 at 11:24:05AM +0400, Ilya Dryomov wrote:
 On Mon, Jul 14, 2014 at 2:55 AM, Dave Chinner da...@fromorbit.com wrote:
  On Sun, Jul 13, 2014 at 09:01:13PM +0400, Ilya Dryomov wrote:
  On Sun, Jul 13, 2014 at 5:48 AM, Samuel Just sam.j...@inktank.com wrote:
  I think if you call fsync() or even
  fdatasync() before close(fd), it will fail as expected.
 
  Only if you are trying to change the extent size immediately after
  the first write you do to an empty file. Which is, as per the above,
  not the recommended or intended use of the ioctl.
 
 That's understood, but that is exactly what Sam's test program happens
 to try to do, so I had to point the file w/o real extents thing out.

Oh, I missed that there was a test program attached. That's what
happens when people top post a reply then attach text at the
bottom

Moral of the story: don't top post ;)

Cheersm

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rpm dependencies

2014-07-14 Thread Sage Weil
Hey Luis,

I pushed wip-dencoder, which moves ceph-dencoder to ceph from ceph-common.  
This avoids the dependency for ceph-common.

For the librados2 package, though, there is still a leveldb dependency 
that makes no sense.  See the packages at

 
http://gitbuilder.ceph.com/ceph-rpm-rhel6_5-x86_64-basic/ref/wip-dencoder/x86_64/

 $ wget 
http://gitbuilder.ceph.com/ceph-rpm-rhel6_5-x86_64-basic/ref/wip-dencoder/x86_64/librados2-0.82-595.g95f5a44.el6.x86_64.rpm
 $ rpm -qpR librados2-0.82-595.g95f5a44.el6.x86_64.rpm | grep level
 warning: librados2-0.82-595.g95f5a44.el6.x86_64.rpm: Header V4 DSA/SHA1 
Signature, key ID 03c3951a: NOKEY
 libleveldb.so.1()(64bit)
 $ rpm2cpio librados2-0.82-595.g95f5a44.el6.x86_64.rpm | cpio -i 
--make-directories
 ...
 $ ldd usr/lib/librados2.so.2 | grep -c leveldb
 1

Our Makefile.am doesn't specify that we link against leveldb for librados, 
although strangely it is listed in the librados.la file anyway (along with 
every other lib anything links against).  Despite this, ldd 
.libs/librados.so on my (debian) box doesn't list it when building 
locally.

The Makefile-env.am has --as-needed, which is supposed to make us only use 
dependencies we actually use.

Not sure what is going on with the RPM build that is putting this 
dependency in place...

Any idea what is going on?  :/

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Read from clones

2014-07-14 Thread Lakshminarayana Mavunduri
Hi,

Following the below set of steps, we are seeing data loss while reading from 
clones.

1)Create an image with image format 2 (in this case we have made the size to be 
1024MB).
rbd create image1 --size 1024 --image-format 2
2)Map the image and write 1024MB worth of data to it.
3)Create a snapshot for the image created in step 1)
rbd snap create image1@snap1
4)Create a clone for the snapshot created in step 3)
rbd clone image1@snap1 clone1
5)Create a snapshot for the clone created in step 4)
rbd snap create clone1@snap2
6)Create a clone for the snapshot created in step 5)
rbd clone clone1@snap2 clone2
7)Shrink the size of the clone created in step 4) (in this case we have made it 
to half of its size)
rbd resize -s 512 --allow-shrink clone1
8)Map the clone created in step 6) and try reading 1024MB worth of data.
9)Our observation is that, only the first 512MB worth data is intact, the rest 
is not copied over. (In fact, it's only the parent overlap worth data of clone1 
that is always copied over!)

After the above set of steps, the parent overlap for clone2 would be 1024MB, 
whereas the parent overlap for clone1 would be 512MB. Our understanding is, 
since clone2's parent snapshot is taken before a shrink is performed on clone1, 
any reads worth parent overlap data on clone2 should be serviced from it's 
parent (at least, as long as there are no overwrites done on clone2, which is 
the case here), and we are not finding that to be true in this case.

To augment our theory, if the parent image (a base RBD image, which is not a 
clone) is shrinked, any reads on the clones that are created before the shrink, 
are as expected to our understanding.

Wanted to check if this is indeed a bug or if we are missing anything here. The 
tests are run across ceph version 0.80.

Thanks,
M.Laks...




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html