[radosgw] Race condition corrupting data on COPY ?

2013-03-18 Thread Sylvain Munaut
Hi,


I've just noticed something rather worrying on our cluster.

Some files are apparently truncated. From the first look I had at it,
it happened on files where there was a metadata update right after the
file was stored. The exact sequence was:

 - PUT to store the file
 - GET to get the file (which at that point is still correct and has
the proper length)
 - PUT using a 'copy source' over itself to update the metadata

all of theses happening sequentially in the same second, very quickly.

Then subsequent GET return a truncated file.


I'm looking into it to narrow down the issue but I wanted to know if
anyone had seen something similar ?


Cheers,

 Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [radosgw] Race condition corrupting data on COPY ?

2013-03-18 Thread Yehuda Sadeh
On Mon, Mar 18, 2013 at 2:50 AM, Sylvain Munaut
s.mun...@whatever-company.com wrote:
 Hi,


 I've just noticed something rather worrying on our cluster.

 Some files are apparently truncated. From the first look I had at it,
 it happened on files where there was a metadata update right after the
 file was stored. The exact sequence was:

  - PUT to store the file
  - GET to get the file (which at that point is still correct and has
 the proper length)
  - PUT using a 'copy source' over itself to update the metadata

 all of theses happening sequentially in the same second, very quickly.

 Then subsequent GET return a truncated file.


 I'm looking into it to narrow down the issue but I wanted to know if
 anyone had seen something similar ?


What version are you using? Do you have logs?

Thanks,
Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-18 Thread Jim Schutt
On 03/15/2013 05:17 PM, Greg Farnum wrote:
 [Putting list back on cc]
 
 On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:
 
 On 03/15/2013 04:23 PM, Greg Farnum wrote:
 As I come back and look at these again, I'm not sure what the context
 for these logs is. Which test did they come from, and which behavior
 (slow or not slow, etc) did you see? :) -Greg



 They come from a test where I had debug mds = 20 and debug ms = 1
 on the MDS while writing files from 198 clients. It turns out that 
 for some reason I need debug mds = 20 during writing to reproduce
 the slow stat behavior later.

 strace.find.dirs.txt.bz2 contains the log of running 
 strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls 
 -lhd {} \;

 From that output, I believe that the stat of at least these files is slow:
 zero0.rc11
 zero0.rc30
 zero0.rc46
 zero0.rc8
 zero0.tc103
 zero0.tc105
 zero0.tc106
 I believe that log shows slow stats on more files, but those are the first 
 few.

 mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
 find command started, until just after the fifth or sixth slow stat from
 the list above.

 I haven't yet tried to find other ways of reproducing this, but so far
 it appears that something happens during the writing of the files that
 ends up causing the condition that results in slow stat commands.

 I have the full MDS log from the writing of the files, as well, but it's
 big

 Is that what you were after?

 Thanks for taking a look!

 -- Jim
 
 I just was coming back to these to see what new information was
 available, but I realized we'd discussed several tests and I wasn't
 sure what these ones came from. That information is enough, yes.
 
 If in fact you believe you've only seen this with high-level MDS
 debugging, I believe the cause is as I mentioned last time: the MDS
 is flapping a bit and so some files get marked as needsrecover, but
 they aren't getting recovered asynchronously, and the first thing
 that pokes them into doing a recover is the stat.

OK, that makes sense.

 That's definitely not the behavior we want and so I'll be poking
 around the code a bit and generating bugs, but given that explanation
 it's a bit less scary than random slow stats are so it's not such a
 high priority. :) Do let me know if you come across it without the
 MDS and clients having had connection issues!

No problem - thanks!

-- Jim


 -Greg
 
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [radosgw] Race condition corrupting data on COPY ?

2013-03-18 Thread Sylvain Munaut
Hi,


 What version are you using? Do you have logs?

I'm running a custom build 0.56.3 + some patches ( basically up
to7889c5412 + fixes for #4150 and #4177 ).

I don't have any radosgw low  ( debug level is set to 0 and it didn't
output anything ).
I have the HTTP logs :

10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
HTTP/1.1 200 0 - Boto/2.6.0 (linux2)
10.0.0.74 s3.svc - [14/Mar/2013:09:23:14 +] GET
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363256594AWSAccessKeyId=XXX
HTTP/1.1 200 622080 - python-requests
10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
HTTP/1.1 200 146 - Boto/2.6.0 (linux2)
10.0.0.74 s3.svc - [14/Mar/2013:10:14:53 +] GET
/rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363258236AWSAccessKeyId=XXX
HTTP/1.1 200 461220 - python-requests


Cheers,

   Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph availability test recovering question

2013-03-18 Thread Andrey Korolyov
Hello,

I`m experiencing same long-lasting problem - during recovery ops, some
percentage of read I/O remains in-flight for seconds, rendering
upper-level filesystem on the qemu client very slow and almost
unusable. Different striping has almost no effect on visible delays
and reads may be non-intensive at all but they still are very slow.

Here is some fio results on randread with small blocks, so it is not
affected by readahead as linear one:

Intensive reads during recovery:
lat (msec) : 2=0.01%, 4=0.08%, 10=1.87%, 20=4.17%, 50=8.34%
lat (msec) : 100=13.93%, 250=2.77%, 500=1.19%, 750=25.13%, 1000=0.41%
lat (msec) : 2000=15.45%, =2000=26.66%

same on healthy cluster:
lat (msec) : 20=0.33%, 50=9.17%, 100=23.35%, 250=25.47%, 750=6.53%
lat (msec) : 1000=0.42%, 2000=34.17%, =2000=0.56%


On Sun, Mar 17, 2013 at 8:18 AM,  kelvin_hu...@wiwynn.com wrote:
 Hi, all

 I have some problem after availability test

 Setup:
 Linux kernel: 3.2.0
 OS: Ubuntu 12.04
 Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + 
 10GbE NIC
 RAID card: LSI MegaRAID SAS 9260-4i  For every HDD: RAID0, Write Policy: 
 Write Back with BBU, Read Policy: ReadAhead, IO Policy: Direct
 Storage server number : 2

 Ceph version : 0.48.2
 Replicas : 2
 Monitor number:3


 We have two storage server as a cluter, then use ceph client create 1T RBD 
 image for testing, the client also
 has 10GbE NIC , Linux kernel 3.2.0 , Ubuntu 12.04

 We also use FIO to produce workload

 fio command:
 [Sequencial Read]
 fio --iodepth = 32 --numjobs=1 --runtime=120  --bs = 65536 --rw = read 
 --ioengine=libaio --group_reporting --direct=1 --eta=always  --ramp_time=10 
 --thinktime=10

 [Sequencial Write]
 fio --iodepth = 32 --numjobs=1 --runtime=120  --bs = 65536 --rw = write 
 --ioengine=libaio --group_reporting --direct=1 --eta=always  --ramp_time=10 
 --thinktime=10


 Now I want observe to ceph state when one storage server is crash, so I turn 
 off one storage server networking.
 We expect that data write and data read operation can be quickly resume or 
 even not be suspended in ceph recovering time, but the experimental results 
 show
 the data write and data read operation will pause for about 20~30 seconds in 
 ceph recovering time.

 My question is:
 1.The state of I/O pause is normal when ceph recovering ?
 2.The pause time of I/O that can not be avoided when ceph recovering ?
 3.How to reduce the I/O pause time ?


 Thanks!!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [radosgw] Race condition corrupting data on COPY ?

2013-03-18 Thread Yehuda Sadeh
On Mon, Mar 18, 2013 at 7:40 AM, Sylvain Munaut
s.mun...@whatever-company.com wrote:
 Hi,


 What version are you using? Do you have logs?

 I'm running a custom build 0.56.3 + some patches ( basically up
 to7889c5412 + fixes for #4150 and #4177 ).

 I don't have any radosgw low  ( debug level is set to 0 and it didn't
 output anything ).
 I have the HTTP logs :

 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT
 /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
 HTTP/1.1 200 0 - Boto/2.6.0 (linux2)
 10.0.0.74 s3.svc - [14/Mar/2013:09:23:14 +] GET
 /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363256594AWSAccessKeyId=XXX
 HTTP/1.1 200 622080 - python-requests
 10.0.0.253 s3.svc - [14/Mar/2013:09:23:14 +] PUT
 /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34
 HTTP/1.1 200 146 - Boto/2.6.0 (linux2)
 10.0.0.74 s3.svc - [14/Mar/2013:10:14:53 +] GET
 /rb/138e6898a8039db16df2146398626f0303ae3e97427fdad33c95b6034f690b34?Signature=XXX%3DExpires=1363258236AWSAccessKeyId=XXX
 HTTP/1.1 200 461220 - python-requests


Can't make much out of it, will probably need rgw logs (and preferably
with also 'debug ms = 1') for this issue.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [radosgw] Race condition corrupting data on COPY ?

2013-03-18 Thread Sylvain Munaut
Hi,

 Can't make much out of it, will probably need rgw logs (and preferably
 with also 'debug ms = 1') for this issue.

Well, the problem is that I can't make it happen again ... it happened
4 times during an import of ~3000 files ... I'm trying to reproduce
this on a test cluster but so far, no luck. I'll give it another shot
tomorrow.

And I can't enable debug on prod for long periods, the space for log
is limited and would be filled in minutes with all the requests. I
also disabled the use of copy in production anyway because I can't
have it corrupt random customer files.


Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Alexandre Oliva
For quite a while, I've experienced oddities with snapshotted Firefox
_CACHE_00?_ files, whose checksums (and contents) would change after the
btrfs snapshot was taken, and would even change depending on how the
file was brought to memory (e.g., rsyncing it to backup storage vs
checking its md5sum before or after the rsync).  This only affected
these cache files, so I didn't give it too much attention.

A similar problem seems to affect the leveldb databases maintained by
ceph within the periodic snapshots it takes of its object storage
volumes.  I'm told others using ceph on filesystems other than btrfs are
not observing this problem, which makes me thing it's not memory
corruption within ceph itself.  I've looked into this for a bit, and I'm
now inclined to believe it has to do with some bad interaction of mmap
and snapshots; I'm not sure the fact that the filesystem has compression
enabled has any effect, but that's certainly a possibility.

leveldb does not modify file contents once they're initialized, it only
appends to files, ftruncate()ing them to about a MB early on, mmap()ping
that in and memcpy()ing blocks of various sizes to the end of the output
buffer, occasionally msync()ing the maps, or running fdatasync if it
didn't msync a map before munmap()ping it.  If it runs out of space in a
map, it munmap()s the previously mapped range, truncates the file to a
larger size, then maps in the new tail of the file, starting at the page
it should append to next.

What I'm observing is that some btrfs snapshots taken by ceph osds,
containing the leveldb database, are corrupted, causing crashes during
the use of the database.

I've scripted regular checks of osd snapshots, saving the
last-known-good database along with the first one that displays the
corruption.  Studying about two dozen failures over the weekend, that
took place on all of 13 btrfs-based osds on 3 servers running btrfs as
in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
similar pattern: a stream of NULs of varying sizes at the end of a page,
starting at a block boundary (leveldb doesn't do page-sized blocking, so
blocks can start anywhere in a page), and ending close to the beginning
of the next page, although not exactly at the page boundary; 20 bytes
past the page boundary seemed to be the most common size, but the
occasional presence of NULs in the database contents makes it harder to
tell for sure.

The stream of NULs ended in the middle of a database block (meaning it
was not the beginning of a subsequent database block written later; the
beginning of the database block was partially replaced with NULs).
Furthermore, the checksum fails to match on this one partially-NULed
block.  Since the checksum is computed just before the block and the
checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
that the block was copied entirely to the right place at some point, and
if part of it became zeros, it's either because the modification was
partially lost, or because the mmapped buffer was partially overwritten.
The fact that all instances of corruption I looked at were correct right
to the end of one block boundary, and then all zeros instead of the
beginning of the subsequent block to the end of that page, makes a
failure to write that modified page seem more likely in my mind (more so
given the Firefox _CACHE_ file oddities in snapshots); intense memory
pressure at the time of the corruption also seems to favor this
possibility.

Now, it could be that btrfs requires those who modify SHARED mmap()ed
files so as to make sure that data makes it to a subsequent snapshot,
along the lines of msync MS_ASYNC, and leveldb does not take this sort
of precaution.  However, I noticed that the unexpected stream of zeros
after a prior block and before the rest of the subsequent block
*remains* in subsequent snapshots, which to me indicates the page update
is effectively lost.  This explains why even the running osd, that
operates on the “current” subvolumes from which snapshots for recovery
are taken, occasionally crashes because of database corruption, and will
later fail to restart from an earlier snapshot due to that same
corruption.


Does this problem sound familiar to anyone else?

Should mmaped-file writers in general do more than umount or msync to
ensure changes make it to subsequent snapshots that are supposed to be
consistent?

Any tips on where to start looking so as to fix the problem, or even to
confirm that the problem is indeed in btrfs?


TIA,

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct IO on CephFS for blocks larger than 8MB

2013-03-18 Thread Greg Farnum
On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote:
 The following patch should fix the problem.
 
 -Henry
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index e51558f..4bcbcb6 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -608,7 +608,7 @@ out:
 pos += len;
 written += len;
 left -= len;
 - data += written;
 + data += len;
 if (left)
 goto more;

This looks good to me. If you'd like to submit it as a proper patch with a 
sign-off I'll pull it into our tree. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct IO on CephFS for blocks larger than 8MB

2013-03-18 Thread Sage Weil
On Mon, 18 Mar 2013, Greg Farnum wrote:
 On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote:
  The following patch should fix the problem.
  
  -Henry
  
  diff --git a/fs/ceph/file.c b/fs/ceph/file.c
  index e51558f..4bcbcb6 100644
  --- a/fs/ceph/file.c
  +++ b/fs/ceph/file.c
  @@ -608,7 +608,7 @@ out:
  pos += len;
  written += len;
  left -= len;
  - data += written;
  + data += len;
  if (left)
  goto more;
 
 This looks good to me. If you'd like to submit it as a proper patch with a 
 sign-off I'll pull it into our tree. :)
 -Greg

I just added a quick test and it fixes it up.  :)

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Alexandre Oliva
While I wrote the previous email, a smoking gun formed in one of my
servers: a snapshot that had passed a database consistency check turned
out to be corrupted when I tried to rollback to it!  Since the snapshot
was not modified in any way between the initial scripted check and the
later manual check, the problem must be in btrfs.

On Mar 18, 2013, Alexandre Oliva ol...@gnu.org wrote:

 I've scripted regular checks of osd snapshots, saving the
 last-known-good database along with the first one that displays the
 corruption.  Studying about two dozen failures over the weekend, that
 took place on all of 13 btrfs-based osds on 3 servers running btrfs as
 in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
 similar pattern: a stream of NULs of varying sizes at the end of a page,
 starting at a block boundary (leveldb doesn't do page-sized blocking, so
 blocks can start anywhere in a page), and ending close to the beginning
 of the next page, although not exactly at the page boundary; 20 bytes
 past the page boundary seemed to be the most common size, but the
 occasional presence of NULs in the database contents makes it harder to
 tell for sure.

Additional corrupted snapshots collected today have confirmed this
pattern, except that today I got several corrupted files with non-NULs
right at the beginning of the page following the one that marked the
beginning of the corrupted database block.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Chris Mason
A few questions.  Does leveldb use O_DIRECT and mmap together? (the
source of a write being pages that are mmap'd from somewhere else)

That's the most likely place for this kind of problem.  Also, you
mention crc errors.  Are those reported by btrfs or are they application
level crcs.

Thanks for all the time you spent tracking it down this far.

-chris

Quoting Alexandre Oliva (2013-03-18 17:14:41)
 For quite a while, I've experienced oddities with snapshotted Firefox
 _CACHE_00?_ files, whose checksums (and contents) would change after the
 btrfs snapshot was taken, and would even change depending on how the
 file was brought to memory (e.g., rsyncing it to backup storage vs
 checking its md5sum before or after the rsync).  This only affected
 these cache files, so I didn't give it too much attention.
 
 A similar problem seems to affect the leveldb databases maintained by
 ceph within the periodic snapshots it takes of its object storage
 volumes.  I'm told others using ceph on filesystems other than btrfs are
 not observing this problem, which makes me thing it's not memory
 corruption within ceph itself.  I've looked into this for a bit, and I'm
 now inclined to believe it has to do with some bad interaction of mmap
 and snapshots; I'm not sure the fact that the filesystem has compression
 enabled has any effect, but that's certainly a possibility.
 
 leveldb does not modify file contents once they're initialized, it only
 appends to files, ftruncate()ing them to about a MB early on, mmap()ping
 that in and memcpy()ing blocks of various sizes to the end of the output
 buffer, occasionally msync()ing the maps, or running fdatasync if it
 didn't msync a map before munmap()ping it.  If it runs out of space in a
 map, it munmap()s the previously mapped range, truncates the file to a
 larger size, then maps in the new tail of the file, starting at the page
 it should append to next.
 
 What I'm observing is that some btrfs snapshots taken by ceph osds,
 containing the leveldb database, are corrupted, causing crashes during
 the use of the database.
 
 I've scripted regular checks of osd snapshots, saving the
 last-known-good database along with the first one that displays the
 corruption.  Studying about two dozen failures over the weekend, that
 took place on all of 13 btrfs-based osds on 3 servers running btrfs as
 in 3.8.3(-gnu), I noticed that all of the corrupted databases had a
 similar pattern: a stream of NULs of varying sizes at the end of a page,
 starting at a block boundary (leveldb doesn't do page-sized blocking, so
 blocks can start anywhere in a page), and ending close to the beginning
 of the next page, although not exactly at the page boundary; 20 bytes
 past the page boundary seemed to be the most common size, but the
 occasional presence of NULs in the database contents makes it harder to
 tell for sure.
 
 The stream of NULs ended in the middle of a database block (meaning it
 was not the beginning of a subsequent database block written later; the
 beginning of the database block was partially replaced with NULs).
 Furthermore, the checksum fails to match on this one partially-NULed
 block.  Since the checksum is computed just before the block and the
 checksum trailer are memcpy()ed to the mmap()ed area, it is a certainty
 that the block was copied entirely to the right place at some point, and
 if part of it became zeros, it's either because the modification was
 partially lost, or because the mmapped buffer was partially overwritten.
 The fact that all instances of corruption I looked at were correct right
 to the end of one block boundary, and then all zeros instead of the
 beginning of the subsequent block to the end of that page, makes a
 failure to write that modified page seem more likely in my mind (more so
 given the Firefox _CACHE_ file oddities in snapshots); intense memory
 pressure at the time of the corruption also seems to favor this
 possibility.
 
 Now, it could be that btrfs requires those who modify SHARED mmap()ed
 files so as to make sure that data makes it to a subsequent snapshot,
 along the lines of msync MS_ASYNC, and leveldb does not take this sort
 of precaution.  However, I noticed that the unexpected stream of zeros
 after a prior block and before the rest of the subsequent block
 *remains* in subsequent snapshots, which to me indicates the page update
 is effectively lost.  This explains why even the running osd, that
 operates on the “current” subvolumes from which snapshots for recovery
 are taken, occasionally crashes because of database corruption, and will
 later fail to restart from an earlier snapshot due to that same
 corruption.
 
 
 Does this problem sound familiar to anyone else?
 
 Should mmaped-file writers in general do more than umount or msync to
 ensure changes make it to subsequent snapshots that are supposed to be
 consistent?
 
 Any tips on where to start looking so as to fix the problem, or even to
 confirm that the problem is indeed 

[PATCH] ceph: fix buffer pointer advance in ceph_sync_write

2013-03-18 Thread Henry C Chang
We should advance the user data pointer by _len_ instead of _written_.
_len_ is the data length written in each iteration while _written_ is the
accumulated data length we have writtent out.

Signed-off-by: Henry C Chang henry.cy.ch...@gmail.com
---
 fs/ceph/file.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index e51558f..4bcbcb6 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -608,7 +608,7 @@ out:
pos += len;
written += len;
left -= len;
-   data += written;
+   data += len;
if (left)
goto more;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct IO on CephFS for blocks larger than 8MB

2013-03-18 Thread Henry C Chang
I just sent out the patch with sign-off. Thanks for testing.

2013/3/19 Sage Weil s...@inktank.com:
 On Mon, 18 Mar 2013, Greg Farnum wrote:
 On Saturday, March 16, 2013 at 5:38 AM, Henry C Chang wrote:
  The following patch should fix the problem.
 
  -Henry
 
  diff --git a/fs/ceph/file.c b/fs/ceph/file.c
  index e51558f..4bcbcb6 100644
  --- a/fs/ceph/file.c
  +++ b/fs/ceph/file.c
  @@ -608,7 +608,7 @@ out:
  pos += len;
  written += len;
  left -= len;
  - data += written;
  + data += len;
  if (left)
  goto more;

 This looks good to me. If you'd like to submit it as a proper patch with a 
 sign-off I'll pull it into our tree. :)
 -Greg

 I just added a quick test and it fixes it up.  :)

 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph availability test recovering question

2013-03-18 Thread Wolfgang Hennerbichler


On 03/17/2013 05:18 AM, kelvin_hu...@wiwynn.com wrote:
 Hi, all

Hi,
 ...
 My question is:
 1.The state of I/O pause is normal when ceph recovering ?

I have experienced the same issue. This works as designed, and is
probably because of the heartbeat-timeout in osd heartbeat grace
period set to 20 secs - see:
http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/

 2.The pause time of I/O that can not be avoided when ceph recovering ?

You can always lower the grace period and heartbeat time, though I don't
know if this is a wise idea. Short networking interruptions might mark
your OSD out very quickly then.

 3.How to reduce the I/O pause time ?

see the link above, or this link here:
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#monitor-osd-interaction

 
 Thanks!!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-18 Thread Alexandre Oliva
On Mar 18, 2013, Chris Mason chris.ma...@fusionio.com wrote:

 A few questions.  Does leveldb use O_DIRECT and mmap together?

No, it doesn't use O_DIRECT at all.  Its I/O interface is very
simplified: it just opens each new file (database chunks limited to 2MB)
with O_CREAT|O_RDWR|O_TRUNC, and then uses ftruncate, mmap, msync,
munmap and fdatasync.  It doesn't seem to modify data once it's written;
it only appends.  Reading data back from it uses a completely different
class interface, using separate descriptors and using pread only.

 (the source of a write being pages that are mmap'd from somewhere
 else)

AFAICT the source of the memcpy()s that append to the file are
malloc()ed memory.

 That's the most likely place for this kind of problem.  Also, you
 mention crc errors.  Are those reported by btrfs or are they application
 level crcs.

These are CRCs leveldb computes and writes out after each db block.  No
btrfs CRC errors are reported in this process.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html