Let me make a simpler case, to do ACID (https://en.wikipedia.org/wiki/ACID) which are all properties you want in a filesystem or a database, you need a journal. You need a journaled filesystem to make the object store's file operations safe. You need a journal in ceph to make sure the object operations are safe. Flipped bits are a separate problem that may be aided by journaling but the primary objective of a journal is to make guarantees about concurrent operations and interrupted operations. There isn't a person on this list who hasn't had an osd die, without a journal starting that osd up again and getting it usable would be impractical.
On Aug 21, 2013, at 8:00 PM, Johannes Klarenbeek <[email protected]> wrote: > > > > I think you are missing the distinction between metadata journaling and data > journaling. In most cases a journaling filesystem is one that journal's it's > own metadata but your data is on its own. Consider the case where you have a > replication level of two, the osd filesystems have journaling disabled and > you append a block to a file (which is an object in terms of ceph) but only > one commits the change in file size to disk. Later you scrub and discover a > discrepancy in object sizes, with a replication level of 2 there is no way to > authoritatively say which one is correct just based on what's in ceph. This > is a similar scenario to a btrfs bug that caused me to lose data with ceph. > Journaling your metadata is the absolute minimum level of assurance you need > to make a transactional system like ceph work. > > Hey Mike J > > I get your point. However, isn’t it then possible to authoritatively say > which one is the correct one in case of 3 OSD’s? > Or is the replication level a configuration setting that tells the cluster > that the object needs to be replicated 3 times? > In both cases, data scrubbing chooses the majority of the same-same > replicated objects in order to know which one is authorative. > > But I also believe (!) that each object has a checksum and each PG too so > that it should be easy to find the corrupted object on any of the OSD’s. > How else would scrubbing find corrupted sectors? Especially when I think > about 2TB SATA disks being hit by cosmic-rays that flip a bit somewhere. > It happens more often with big cheap TB disks, but that doesn’t mean the > corrupted sector is a bad sector (in not useable anymore). Journaling is not > going to help anyone with this. > Therefor I believe (again) that the data scrubber must have a mechanism to > detect these types of corruptions even in a 2 OSD setup by means of checksums > (or better, with a hashed checksum id). > > Also, aren’t there 2 types of transactions; one for writing and one for > replicating? > > On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek > <[email protected]> wrote: > > > Dear ceph-users, > > I read a lot of documentation today about ceph architecture and linux file > system benchmarks in particular and I could not help notice something that I > like to clear up for myself. Take into account that it has been a while that > I actually touched linux, but I did some programming on php2b12 and apache > back in the days so I’m not a complete newbie. The real question is below if > you do not like reading the rest ;) > > What I have come to understand about file systems for OSD’s is that in theory > btrfs is the file system of choice. However, due to its young age it’s not > considered stable yet. Therefore EXT4 but preferably XFS is used in most > cases. It seems that most people choose this system because of its journaling > feature and XFS for its additional attribute storage which has a 64kb limit > which should be sufficient for most operations. > > But when you look at file system benchmarks btrfs is really, really slow. > Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput > results. On journaling systems (like XFS, EXT4 and btrfs) disabling > journaling actually helps throughput as well. Sometimes more then 2 times for > write actions. > > The preferred configuration for OSD’s is one OSD per disk. Each object is > striped among all Object Storage Daemons in a cluster. So if I would take one > disk for the cluster and check its data, chances are slim that I will find a > complete object there (a non-striped, full object I mean). > > When a client issues an object write (I assume a full object/file write in > this case) it is the client’s responsibility to stripe it among the object > storage daemons. When a stripe is successfully stored by the daemon an ACK > signal is send to (?) the client and all participating OSD’s. When all > participating OSD’s for the object have completed the client assumes all is > well and returns control to the application > > If I’m not mistaken, then journaling is meant for the rare occasions that a > hardware failure will occur and the data is corrupted. Ceph does this too in > another way of course. But ceph should be able to notice when a block/stripe > is correct or not. In the rare occasion that a node is failing while doing a > write; an ACK signal is not send to the caller and therefor the client can > resend the block/stripe to another OSD. Therefor I fail to see the purpose of > this extra journaling feature. > > Also ceph schedules a data scrubbing process every day (or however it is > configured) that should be able to tackle bad sectors or other errors on the > file system and accordingly repair them on the same daemon or flag the whole > block as bad. Since everything is replicated the block is still in the > storage cluster so no harm is done. > > In a normal/single file system I truly see the value of journaling and the > potential for btrfs (although it’s still very slow). However in a system like > ceph, journaling seems to me more like a paranoid super fail save. > > Did anyone experiment with file systems that disabled journaling and how did > it perform? > > Regards, > Johannes > > > > > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
