On Wed, 7 Jan 2015, Ma, Jianpeng wrote: > > ---------- Forwarded message ---------- > > From: Pawe? Sadowski <[email protected]> > > Date: 2014-12-30 21:40 GMT+08:00 > > Subject: Re: Ceph data consistency > > To: Vijayendra Shamanna <[email protected]>, > > "[email protected]" <[email protected]> > > > > > > On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote: > > > Hi, > > > > > > There is a sync thread (sync_entry in FileStore.cc) which triggers > > > periodically and executes sync_filesystem() to ensure that the data is > > > consistent. The journal entries are trimmed only after a successful > > > sync_filesystem() call > > > > sync_filesystem() always returns zero and journal will be trimmed. > > Executing sync()/syncfs() with dirty data in disk buffers will result in > > data loss > > ("lost page write due to I/O error"). > > > Hi sage: > > From the git log, I see at first sync_filesystem() return the result of > syncfs(). > But in this commit 808c644248e486f44: > Improve use of syncfs. > Test syncfs return value and fallback to btrfs sync and then sync. > The author hope if syncfs() met error and sync() can resolve. Because sync() > don't return result > So it only return zero. > But which error can handle by this way? AFAK, no. > I suggest it directly return result of syncfs().
Yeah, that sounds right! sage > > Jianpeng Ma > Thanks! > > > > I was doing some experiments simulating disk errors using Device Mapper > > "error" target. In this setup OSD was writing to broken disk without > > crashing. > > Every 5 seconds (filestore_max_sync_interval) kernel logs that some data > > were > > discarded due to IO error. > > > > > > > Thanks > > > Viju > > >> -----Original Message----- > > >> From: [email protected] > > >> [mailto:[email protected]] On Behalf Of Pawel Sadowski > > >> Sent: Tuesday, December 30, 2014 1:52 PM > > >> To: [email protected] > > >> Subject: Ceph data consistency > > >> > > >> Hi, > > >> > > >> On our Ceph cluster from time to time we have some inconsistent PGs > > >> (after > > deep-scrub). We have some issues with disk/sata cables/lsi controller > > causing > > IO errors from time to time (but that's not the point in this case). > > >> > > >> When IO error occurs on OSD journal partition everything works as is > > >> should > > -> OSD is crashed and that's ok - Ceph will handle that. > > >> > > >> But when IO error occurs on OSD data partition during journal flush OSD > > continue to work. After calling *writev* (in buffer::list::write_fd) OSD > > does > > check return code from this call but does NOT verify if write has been > > successful > > to disk (data are still only >in memory and there is no fsync). That way OSD > > thinks that data has been stored on disk but it might be discarded (during > > sync > > dirty page will be reclaimed and you'll see "lost page write due to I/O > > error" in > > dmesg). > > >> > > >> Since there is no checksumming of data I just wanted to make sure that > > >> this > > is by design. Maybe there is a way to tell OSD to call fsync after write > > and have > > data consistent? > > > > -- > > PS > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body > > of a message to [email protected] More majordomo info at > > http://vger.kernel.org/majordomo-info.html > N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??? > ???j:+v???w????????????zZ+???????j"????i -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
