RE: Ceph data consistency

Sage Weil Tue, 06 Jan 2015 18:19:13 -0800

On Wed, 7 Jan 2015, Ma, Jianpeng wrote:
> > ---------- Forwarded message ----------
> > From: Pawe? Sadowski <[email protected]>
> > Date: 2014-12-30 21:40 GMT+08:00
> > Subject: Re: Ceph data consistency
> > To: Vijayendra Shamanna <[email protected]>,
> > "[email protected]" <[email protected]>
> > 
> > 
> > On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> > > Hi,
> > >
> > > There is a sync thread (sync_entry in FileStore.cc) which triggers
> > > periodically and executes sync_filesystem() to ensure that the data is
> > > consistent. The journal entries are trimmed only after a successful
> > > sync_filesystem() call
> > 
> > sync_filesystem() always returns zero and journal will be trimmed.
> > Executing sync()/syncfs() with dirty data in disk buffers will result in 
> > data loss
> > ("lost page write due to I/O error").
> > 
> Hi sage:
> 
> From the git log, I see at first sync_filesystem() return the result of 
> syncfs().
> But in this commit 808c644248e486f44:
>     Improve use of syncfs.
>     Test syncfs return value and fallback to btrfs sync and then sync.
> The author hope if syncfs() met error and sync() can resolve. Because sync() 
> don't return result 
> So it only return zero.
> But which error can handle by this way? AFAK, no.
> I suggest it directly return result of syncfs().


Yeah, that sounds right!

sage


> 
> Jianpeng Ma
> Thanks!
> 
> 
> > I was doing some experiments simulating disk errors using Device Mapper
> > "error" target. In this setup OSD was writing to broken disk without 
> > crashing.
> > Every 5 seconds (filestore_max_sync_interval) kernel logs that some data 
> > were
> > discarded due to IO error.
> > 
> > 
> > > Thanks
> > > Viju
> > >> -----Original Message-----
> > >> From: [email protected]
> > >> [mailto:[email protected]] On Behalf Of Pawel Sadowski
> > >> Sent: Tuesday, December 30, 2014 1:52 PM
> > >> To: [email protected]
> > >> Subject: Ceph data consistency
> > >>
> > >> Hi,
> > >>
> > >> On our Ceph cluster from time to time we have some inconsistent PGs 
> > >> (after
> > deep-scrub). We have some issues with disk/sata cables/lsi controller 
> > causing
> > IO errors from time to time (but that's not the point in this case).
> > >>
> > >> When IO error occurs on OSD journal partition everything works as is 
> > >> should
> > -> OSD is crashed and that's ok - Ceph will handle that.
> > >>
> > >> But when IO error occurs on OSD data partition during journal flush OSD
> > continue to work. After calling *writev* (in buffer::list::write_fd) OSD 
> > does
> > check return code from this call but does NOT verify if write has been 
> > successful
> > to disk (data are still only >in memory and there is no fsync). That way OSD
> > thinks that data has been stored on disk but it might be discarded (during 
> > sync
> > dirty page will be reclaimed and you'll see "lost page write due to I/O 
> > error" in
> > dmesg).
> > >>
> > >> Since there is no checksumming of data I just wanted to make sure that 
> > >> this
> > is by design. Maybe there is a way to tell OSD to call fsync after write 
> > and have
> > data consistent?
> > 
> > --
> > PS
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> > the body
> > of a message to [email protected] More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w???
> ???j:+v???w????????????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Ceph data consistency

Reply via email to