On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > > Hi Guido,
> > >
> > > yeah, there is something weird going on. I just started to establish some
> > > test-VM's. Freshly imported from running *.qcow2 images.
> > > Kernel panic with INIT, seg-faults and other "funny" stuff.
> > >
> > > Just added the rbd_cache=true in my config, voila. All is
> > > fast-n-up-n-running...
> > > All my testing was done with cache enabled... Since our errors all came
> > > from rbd_writeback from former ceph-versions...
> >
> > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > 'debug ms = 1'? Ideally we'd like to:
> >
> > - reproduce from a fresh vm, with osd logs
> > - identify the bad file
> > - map that file to a block offset (see
> > http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > - use that to identify the badness in the log
> >
> > I suspect the cache is just masking the problem because it submits fewer
> > IOs...
>
> Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
> 'filestore fiemap = false' under [osd] and started a new VM. That worked
> nicely, and the iotester found no corruptions. Then I removed 'filestore
> fiemap = false' from the config, restarted all osds and ran the iotester
> again. Output is as follows:
>
> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> Mon Jun 11 17:34:44 CEST 2012
> Wrote 100 MiB of data in 1943 milliseconds
> Wrote 100 MiB of data in 1858 milliseconds
> Wrote 100 MiB of data in 2213 milliseconds
> Wrote 100 MiB of data in 3441 milliseconds
> Wrote 100 MiB of data in 2705 milliseconds
> Wrote 100 MiB of data in 1778 milliseconds
> Wrote 100 MiB of data in 1974 milliseconds
> Wrote 100 MiB of data in 2780 milliseconds
> Wrote 100 MiB of data in 1961 milliseconds
> Wrote 100 MiB of data in 2366 milliseconds
> Wrote 100 MiB of data in 1886 milliseconds
> Wrote 100 MiB of data in 3589 milliseconds
> Wrote 100 MiB of data in 1973 milliseconds
> Wrote 100 MiB of data in 2506 milliseconds
> Wrote 100 MiB of data in 1937 milliseconds
> Wrote 100 MiB of data in 3404 milliseconds
> Wrote 100 MiB of data in 1990 milliseconds
> Wrote 100 MiB of data in 3713 milliseconds
> Read 100 MiB of data in 4856 milliseconds
> Digest wrong for file "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa"
> Mon Jun 11 17:35:34 CEST 2012
> testserver-rbd11 iotester # ~/fiemap
> /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> # Logical Physical Length Flags
> 0: 0000000000000000 00000000a8200000 0000000000100000 000
>
> I looked into the file in question, and it started with zero-bytes from the
> start until position 0xbff, even though it was supposed to all random data.
>
> I have included timestamps in the hopes they might make it easier to find the
> related entries in the logs.
>
> So what do I do now? The logs are very large and complex, and I don't
> understand most of what's in there. I don't even know which OSD served that
> particular block/object.
If you can reproduce it with 'debug filestore = 20' too, that will be
better, as it will tell us what the FIEMAP ioctl is returning. Also, if
you can attach/post the contents of the object itself (rados -p rbd get
rb.0.1.0000000002a0 /tmp/foo) we can make sure the object has the right
data (and the sparse-read operation that librbd is doing is the culprit).
As for the log:
First, map the offset to an rbd block. For example, taking the 'Physical'
value of 00000000a8200000 from above:
$ printf "%012x\n" $((0x00000000a8200000 / (4096*1024) ))
0000000002a0
Then figure out what the object name prefix is:
$ rbd info <imagename> | grep prefix
block_name_prefix: rb.0.1
Then add the block number, 0000000002a0 to that, e.g. rb.0.1.0000000002a0.
Then map that back to an osd with
$ ceph osd map rbd rb.0.1.0000000002a0
osdmap e19 pool 'rbd' (2) object 'rb.0.1.0000000002a0' -> pg 2.a2e06f65
(2.5) -> up [0,2] acting [0,2]
You'll see the osd ids listed in brackets after 'active'. We want the
first one, 0 in my example. The log from that OSD is what we need.
Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html