I think you are missing the distinction between metadata journaling and data journaling. In most cases a journaling filesystem is one that journal's it's own metadata but your data is on its own. Consider the case where you have a replication level of two, the osd filesystems have journaling disabled and you append a block to a file (which is an object in terms of ceph) but only one commits the change in file size to disk. Later you scrub and discover a discrepancy in object sizes, with a replication level of 2 there is no way to authoritatively say which one is correct just based on what's in ceph. This is a similar scenario to a btrfs bug that caused me to lose data with ceph. Journaling your metadata is the absolute minimum level of assurance you need to make a transactional system like ceph work.
On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek <johannes.klarenb...@rigo.nl> wrote: > Dear ceph-users, > > I read a lot of documentation today about ceph architecture and linux file > system benchmarks in particular and I could not help notice something that I > like to clear up for myself. Take into account that it has been a while that > I actually touched linux, but I did some programming on php2b12 and apache > back in the days so I’m not a complete newbie. The real question is below if > you do not like reading the rest ;) > > What I have come to understand about file systems for OSD’s is that in theory > btrfs is the file system of choice. However, due to its young age it’s not > considered stable yet. Therefore EXT4 but preferably XFS is used in most > cases. It seems that most people choose this system because of its journaling > feature and XFS for its additional attribute storage which has a 64kb limit > which should be sufficient for most operations. > > But when you look at file system benchmarks btrfs is really, really slow. > Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput > results. On journaling systems (like XFS, EXT4 and btrfs) disabling > journaling actually helps throughput as well. Sometimes more then 2 times for > write actions. > > The preferred configuration for OSD’s is one OSD per disk. Each object is > striped among all Object Storage Daemons in a cluster. So if I would take one > disk for the cluster and check its data, chances are slim that I will find a > complete object there (a non-striped, full object I mean). > > When a client issues an object write (I assume a full object/file write in > this case) it is the client’s responsibility to stripe it among the object > storage daemons. When a stripe is successfully stored by the daemon an ACK > signal is send to (?) the client and all participating OSD’s. When all > participating OSD’s for the object have completed the client assumes all is > well and returns control to the application > > If I’m not mistaken, then journaling is meant for the rare occasions that a > hardware failure will occur and the data is corrupted. Ceph does this too in > another way of course. But ceph should be able to notice when a block/stripe > is correct or not. In the rare occasion that a node is failing while doing a > write; an ACK signal is not send to the caller and therefor the client can > resend the block/stripe to another OSD. Therefor I fail to see the purpose of > this extra journaling feature. > > Also ceph schedules a data scrubbing process every day (or however it is > configured) that should be able to tackle bad sectors or other errors on the > file system and accordingly repair them on the same daemon or flag the whole > block as bad. Since everything is replicated the block is still in the > storage cluster so no harm is done. > > In a normal/single file system I truly see the value of journaling and the > potential for btrfs (although it’s still very slow). However in a system like > ceph, journaling seems to me more like a paranoid super fail save. > > Did anyone experiment with file systems that disabled journaling and how did > it perform? > > Regards, > Johannes > > > > > > > __________ Informatie van ESET Endpoint Antivirus, versie van database > viruskenmerken 8713 (20130821) __________ > > Het bericht is gecontroleerd door ESET Endpoint Antivirus. > > http://www.eset.com > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com