Re: [ceph-users] Storage, File Systems and Data Scrubbing

Mike Lowe Wed, 21 Aug 2013 18:04:55 -0700

Let me make a simpler case, to do ACID (https://en.wikipedia.org/wiki/ACID) 
which are all properties you want in a filesystem or a database, you need a 
journal.  You need a journaled filesystem to make the object store's file 
operations safe.  You need a journal in ceph to make sure the object operations 
are safe.  Flipped bits are a separate problem that may be aided by journaling 
but the primary objective of a journal is to make guarantees about concurrent 
operations and interrupted operations.  There isn't a person on this list who 
hasn't had an osd die, without a journal starting that osd up again and getting 
it usable would be impractical.


On Aug 21, 2013, at 8:00 PM, Johannes Klarenbeek <[email protected]> 
wrote:

>  
>  
>  
> I think you are missing the distinction between metadata journaling and data 
> journaling.  In most cases a journaling filesystem is one that journal's it's 
> own metadata but your data is on its own.  Consider the case where you have a 
> replication level of two, the osd filesystems have journaling disabled and 
> you append a block to a file (which is an object in terms of ceph) but only 
> one commits the change in file size to disk.  Later you scrub and discover a 
> discrepancy in object sizes, with a replication level of 2 there is no way to 
> authoritatively say which one is correct just based on what's in ceph.  This 
> is a similar scenario to a btrfs bug that caused me to lose data with ceph.  
> Journaling your metadata is the absolute minimum level of assurance you need 
> to make a transactional system like ceph work.
>  
> Hey Mike J
>  
> I get your point. However, isn’t it then possible to authoritatively say 
> which one is the correct one in case of 3 OSD’s?
> Or is the replication level a configuration setting that tells the cluster 
> that the object needs to be replicated 3 times?
> In both cases, data scrubbing chooses the majority of the same-same 
> replicated objects in order to know which one is authorative.
>  
> But I also believe (!) that each object has a checksum and each PG too so 
> that it should be easy to find the corrupted object on any of the OSD’s.
> How else would scrubbing find corrupted sectors? Especially when I think 
> about 2TB SATA disks being hit by cosmic-rays that flip a bit somewhere.
> It happens more often with big cheap TB disks, but that doesn’t mean the 
> corrupted sector is a bad sector (in not useable anymore). Journaling is not 
> going to help anyone with this.
> Therefor I believe (again) that the data scrubber must have a mechanism to 
> detect these types of corruptions even in a 2 OSD setup by means of checksums 
> (or better, with a hashed checksum id).
>  
> Also, aren’t there 2 types of transactions; one for writing and one for 
> replicating?
>  
> On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek 
> <[email protected]> wrote:
>  
> 
> Dear ceph-users,
>  
> I read a lot of documentation today about ceph architecture and linux file 
> system benchmarks in particular and I could not help notice something that I 
> like to clear up for myself. Take into account that it has been a while that 
> I actually touched linux, but I did some programming on php2b12 and apache 
> back in the days so I’m not a complete newbie. The real question is below if 
> you do not like reading the rest ;)
>  
> What I have come to understand about file systems for OSD’s is that in theory 
> btrfs is the file system of choice. However, due to its young age it’s not 
> considered stable yet. Therefore EXT4 but preferably XFS is used in most 
> cases. It seems that most people choose this system because of its journaling 
> feature and XFS for its additional attribute storage which has a 64kb limit 
> which should be sufficient for most operations.
>  
> But when you look at file system benchmarks btrfs is really, really slow. 
> Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput 
> results. On journaling systems (like XFS, EXT4 and btrfs) disabling 
> journaling actually helps throughput as well. Sometimes more then 2 times for 
> write actions.
>  
> The preferred configuration for OSD’s is one OSD per disk. Each object is 
> striped among all Object Storage Daemons in a cluster. So if I would take one 
> disk for the cluster and check its data, chances are slim that I will find a 
> complete object there (a non-striped, full object I mean).
>  
> When a client issues an object write (I assume a full object/file write in 
> this case) it is the client’s responsibility to stripe it among the object 
> storage daemons. When a stripe is successfully stored by the daemon an ACK 
> signal is send to (?) the client and all participating OSD’s. When all 
> participating OSD’s for the object have completed the client assumes all is 
> well and returns control to the application
>  
> If I’m not mistaken, then journaling is meant for the rare occasions that a 
> hardware failure will occur and the data is corrupted. Ceph does this too in 
> another way of course. But ceph should be able to notice when a block/stripe 
> is correct or not. In the rare occasion that a node is failing while doing a 
> write; an ACK signal is not send to the caller and therefor the client can 
> resend the block/stripe to another OSD. Therefor I fail to see the purpose of 
> this extra journaling feature.
>  
> Also ceph schedules a data scrubbing process every day (or however it is 
> configured) that should be able to tackle bad sectors or other errors on the 
> file system and accordingly repair them on the same daemon or flag the whole 
> block as bad. Since everything is replicated the block is still in the 
> storage cluster so no harm is done.
>  
> In a normal/single file system I truly see the value of journaling and the 
> potential for btrfs (although it’s still very slow). However in a system like 
> ceph, journaling seems to me more like a paranoid super fail save.
>  
> Did anyone experiment with file systems that disabled journaling and how did 
> it perform?
>  
> Regards,
> Johannes
>  
>  
>  
>  
> 
> 
> __________ Informatie van ESET Endpoint Antivirus, versie van database 
> viruskenmerken 8713 (20130821) __________
> 
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
> 
> http://www.eset.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> 
> 
> __________ Informatie van ESET Endpoint Antivirus, versie van database 
> viruskenmerken 8713 (20130821) __________
> 
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
> 
> http://www.eset.com
> 
> 
> __________ Informatie van ESET Endpoint Antivirus, versie van database 
> viruskenmerken 8713 (20130821) __________
> 
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
> 
> http://www.eset.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage, File Systems and Data Scrubbing

Reply via email to