Le 28/01/2016 22:32, Jan Schermer a écrit :
> P.S. I feel very strongly that this whole concept is broken
> fundamentaly. We already have a journal for the filesystem which is
> time proven, well behaved and above all fast. Instead there's this
> reinvented wheel which supposedly does it better in userspace while
> not really avoiding the filesystem journal either. It would maybe make
> sense if OSD was storing the data on a block device directly, avoiding
> the filesystem altogether. But it would still do the same bloody thing
> and (no disrespect) ext4 does this better than Ceph ever will.
>

Hum I've seen this discussed previously but I'm not sure the fs journal
could be used as a Ceph journal.

First BTRFS doesn't have a journal per se, so you would not be able to
use xfs or ext4 journal on another device with journal=data setup to
make write bursts/random writes fast. And I won't go back to XFS or test
ext4... I've detected too much silent corruption by hardware with BTRFS
to trust our data to any filesystem not using CRC on reads (and in our
particular case the compression and speed are additional bonuses).

Second I'm not familiar with Ceph internals but OSDs must make sure that
their PGs are synced so I was under the impression that the OSD content
for a PG on the filesystem should always be guaranteed to be on all the
other active OSDs *or* their journals (so you wouldn't apply journal
content unless the other journals have already committed the same
content). If you remove the journals there's no intermediate on-disk
"buffer" that can be used to guarantee such a thing: one OSD will always
have data that won't be guaranteed to be on disk on the others. As I
understand this you could say that this is some form of 2-phase commit.

I may be mistaken: there are structures in the filestore that *may* take
on this role but I'm not sure what their exact use is : the
<pg_num>_TEMP dirs, the omap and meta dirs. My guess is that they serve
other purposes: it would make sense to use the journals for this because
the data is already there and the commit/apply coherency barriers seem
both trivial and efficient to use.

That's not to say that the journals are the only way to maintain the
needed coherency, just that they might be used to do so because once
they are here, this is a trivial extension of their use.

Lionel
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to