I think you are missing the distinction between metadata journaling and data 
journaling.  In most cases a journaling filesystem is one that journal's it's 
own metadata but your data is on its own.  Consider the case where you have a 
replication level of two, the osd filesystems have journaling disabled and you 
append a block to a file (which is an object in terms of ceph) but only one 
commits the change in file size to disk.  Later you scrub and discover a 
discrepancy in object sizes, with a replication level of 2 there is no way to 
authoritatively say which one is correct just based on what's in ceph.  This is 
a similar scenario to a btrfs bug that caused me to lose data with ceph.  
Journaling your metadata is the absolute minimum level of assurance you need to 
make a transactional system like ceph work.

On Aug 21, 2013, at 4:23 PM, Johannes Klarenbeek <johannes.klarenb...@rigo.nl> 
wrote:

> Dear ceph-users,
>  
> I read a lot of documentation today about ceph architecture and linux file 
> system benchmarks in particular and I could not help notice something that I 
> like to clear up for myself. Take into account that it has been a while that 
> I actually touched linux, but I did some programming on php2b12 and apache 
> back in the days so I’m not a complete newbie. The real question is below if 
> you do not like reading the rest ;)
>  
> What I have come to understand about file systems for OSD’s is that in theory 
> btrfs is the file system of choice. However, due to its young age it’s not 
> considered stable yet. Therefore EXT4 but preferably XFS is used in most 
> cases. It seems that most people choose this system because of its journaling 
> feature and XFS for its additional attribute storage which has a 64kb limit 
> which should be sufficient for most operations.
>  
> But when you look at file system benchmarks btrfs is really, really slow. 
> Then comes XFS, then EXT4, but EXT2 really dwarfs all other throughput 
> results. On journaling systems (like XFS, EXT4 and btrfs) disabling 
> journaling actually helps throughput as well. Sometimes more then 2 times for 
> write actions.
>  
> The preferred configuration for OSD’s is one OSD per disk. Each object is 
> striped among all Object Storage Daemons in a cluster. So if I would take one 
> disk for the cluster and check its data, chances are slim that I will find a 
> complete object there (a non-striped, full object I mean).
>  
> When a client issues an object write (I assume a full object/file write in 
> this case) it is the client’s responsibility to stripe it among the object 
> storage daemons. When a stripe is successfully stored by the daemon an ACK 
> signal is send to (?) the client and all participating OSD’s. When all 
> participating OSD’s for the object have completed the client assumes all is 
> well and returns control to the application
>  
> If I’m not mistaken, then journaling is meant for the rare occasions that a 
> hardware failure will occur and the data is corrupted. Ceph does this too in 
> another way of course. But ceph should be able to notice when a block/stripe 
> is correct or not. In the rare occasion that a node is failing while doing a 
> write; an ACK signal is not send to the caller and therefor the client can 
> resend the block/stripe to another OSD. Therefor I fail to see the purpose of 
> this extra journaling feature.
>  
> Also ceph schedules a data scrubbing process every day (or however it is 
> configured) that should be able to tackle bad sectors or other errors on the 
> file system and accordingly repair them on the same daemon or flag the whole 
> block as bad. Since everything is replicated the block is still in the 
> storage cluster so no harm is done.
>  
> In a normal/single file system I truly see the value of journaling and the 
> potential for btrfs (although it’s still very slow). However in a system like 
> ceph, journaling seems to me more like a paranoid super fail save.
>  
> Did anyone experiment with file systems that disabled journaling and how did 
> it perform?
>  
> Regards,
> Johannes
>  
>  
>  
>  
> 
> 
> __________ Informatie van ESET Endpoint Antivirus, versie van database 
> viruskenmerken 8713 (20130821) __________
> 
> Het bericht is gecontroleerd door ESET Endpoint Antivirus.
> 
> http://www.eset.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to