On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> Hi,
>
> There is a sync thread (sync_entry in FileStore.cc) which triggers 
> periodically and executes sync_filesystem() to ensure that the data is 
> consistent. The journal entries are trimmed only after a successful 
> sync_filesystem() call

sync_filesystem() always returns zero and journal will be trimmed. Executing 
sync()/syncfs() with dirty data in disk buffers will result in data loss ("lost 
page write due to I/O error").

I was doing some experiments simulating disk errors using Device Mapper "error" 
target. In this setup OSD was writing to broken disk without crashing. Every 5 
seconds (filestore_max_sync_interval) kernel logs that some data were discarded 
due to IO error.


> Thanks
> Viju
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Pawel Sadowski
>> Sent: Tuesday, December 30, 2014 1:52 PM
>> To: [email protected]
>> Subject: Ceph data consistency
>>
>> Hi,
>>
>> On our Ceph cluster from time to time we have some inconsistent PGs (after 
>> deep-scrub). We have some issues with disk/sata cables/lsi controller 
>> causing IO errors from time to time (but that's not the point in this case).
>>
>> When IO error occurs on OSD journal partition everything works as is should 
>> -> OSD is crashed and that's ok - Ceph will handle that.
>>
>> But when IO error occurs on OSD data partition during journal flush OSD 
>> continue to work. After calling *writev* (in buffer::list::write_fd) OSD 
>> does check return code from this call but does NOT verify if write has been 
>> successful to disk (data are still only >in memory and there is no fsync). 
>> That way OSD thinks that data has been stored on disk but it might be 
>> discarded (during sync dirty page will be reclaimed and you'll see "lost 
>> page write due to I/O error" in dmesg).
>>
>> Since there is no checksumming of data I just wanted to make sure that this 
>> is by design. Maybe there is a way to tell OSD to call fsync after write and 
>> have data consistent?

-- 
PS
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to