Thanks for all your hard work on CephFS. This progress is very exciting to
hear about. I am constantly amazed at the amount of work that gets done in
Ceph in so short an amount of time.

On Mon, Apr 20, 2015 at 6:26 PM, Gregory Farnum <[email protected]> wrote:

> We’ve been hard at work on CephFS over the last year since Firefly was
> released, and with Hammer coming out it seemed like a good time to go over
> some of the big developments users will find interesting. Much of this is
> cribbed from John’s Linux Vault talk (
> http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf),
> in addition to the release notes (
> http://ceph.com/docs/master/release-notes/).
> ===========================================================================
> New Filesystem features & improvements:
>
> ceph-fuse has gained support for fcntl and flock locking. (Yan, Zheng)
> This has been in the kernel for a while but nobody had done the work to
> implement tracking structures and wire it up in userspace.
>
> ceph-fuse has gained support for soft quotas, enforced on the client side.
> (Yunchuan Wen) The Ubuntu Kylin guys worked on this for quite a while and
> we thank them for their work and their patience. You can now specify soft
> quotas on a directory and ceph-fuse will behave as you’d expect from that.
>
> Hadoop support has been generally improved and updated. (Noah Watkins,
> Huamin Chen) It now works against the 2.0 API, the tests we run in our lab
> are more sophisticated, and it’s a lot friendlier to install with Maven and
> other Java tools. Noah’s still doing work on this to make it as turnkey as
> possible, but soon you’ll just need to drop a single JAR on the system
> (this will include the libcephfs stuff, so you don’t even need to worry
> about those packages and compatibility!) and change a few config options.
>
> ceph-fuse and CephFS as a whole now have much-improved full space
> handling. If you run out of space at the RADOS layer you will get ENOSPC
> errors in the client (instead of it retrying indefinitely), and these
> errors (and others) are now propagated out to fsync and fclose calls.
>
> We are now much more consistent in our handling of timestamps. Previously
> we attempted to take the time from whichever process was responsible for
> making a change, which could be either a client or the MDS. But this was
> troublesome if their times weren’t synced — made worse by trying not to let
> the time move backwards — and some applications which relied on sharing
> mtime and ctime values as versions (Hadoop and rsync both did this in
> certain configurations) were unhappy. We now use a timestamp provided by
> the client for all operations, which has been more stable.
>
> Certain internal data structures are now much more scalable on a
> per-client level. We had issues when certain “MDSTables” got too large, but
> John Spray sorted them out.
>
> The reconnect phase, when an MDS is restarted or dies and the clients have
> to connect to a different daemon, has been made much faster in the typical
> case. (Yan, Zheng)
>
> ===========================================================================
> Administrator features & improvements:
>
> The MDS has gained an OpTracker, with functionality similar to that in the
> OSD. You can dump in-flight requests and notably slow ones from the recent
> past. The changes to enable this also made working with many code paths a
> lot easier.
>
> We’ve changed how you create and manage CephFS file systems in a cluster.
> (John Spray) The “data” and “metadata” pools are no longer created by
> default, and the management is done via monitor commands that start with
> “ceph fs” (eg, “ceph fs new”). These have been designed with future
> extensions in mind, but for now they mostly replicate existing features
> with more consistency and improved repeatability/idempotency.
>
> The MDS now reports on a variety of health metrics to the monitor, joining
> the existing OSD and monitor health reports. These include information on
> misbehaving clients and MDS data structures. (John Spray)
>
> The MDS admin socket now includes a bunch of new commands. You can examine
> and evict client sessions, plus do things around filesystem repair (see
> below).
>
> The MDS now gathers metadata from the clients about who they are and
> shares that with users via a variety of helpful interfaces and warning
> messages. (John Spray)
>
> ===========================================================================
> Recovery tools
>
> We have a new MDS journal format and a new cephfs-journal-tool. (John
> Spray) This eliminates the days of needing to hexedit a journal dump in
> order to let your MDS start back up — you can inspect the journal state
> (human-readable or json, great for our testing!) and make changes on a
> per-event level. It also includes the ability to scan through hopelessly
> broken journals and parse out whatever data is available for flushing to
> the backing RADOS objects.
> Similarly, there’s a cephfs-table-tool for working with the SessionTable,
> InoTable, and SnapTable. (John Spray)
>
> We’ve added new “scrub_path” and “flush_path” commands to the admin
> socket. These are fairly limited right now but will check that both
> directories and files are self-consistent. It’s a building block for the
> "forward scrub" and fsck features that I’ve been working on, and includes a
> lot of code-level work to enable those.
>
> ===========================================================================
> Performance improvements
>
> Both the kernel and userspace clients are a lot more efficient with some
> of their “capability” and directory content handling. This lets them serve
> a lot more out of local cache, a lot more often, than they were able to
> previously. This is particularly noticeable in workloads where a single
> client “owned” a directory but another client periodically peeked in on it.
> There are also a bunch of extra improvements in this area that have gone
> in since Hammer and will be released in Infernalis. ;)
>
> The code in the MDS that handles the journaling has been split into a
> separate thread. (Yan, Zheng) This has increased maximum throughput a fair
> bit and is the first major improvement enabled by John’s work to start
> breaking down the big MDS lock. (We still have a big MDS lock, but in
> addition to the journal it no longer covers the Objecter. Setting up the
> interfaces to make that manageable should make future lock sharding and
> changes a lot simpler than they would have been previously.)
>
> ===========================================================================
> Developer & test improvements
>
> In addition to a slightly expanded set of black-box tests, we are now
> testing FS behaviors to make sure everything behaves as expected in
> specific scenarios (failure and otherwise). This is largely thanks to John,
> but we’re doing more with it in general as we add features that can be
> tested this way.
>
> As alluded to in previous sections, we’ve done a lot of work that makes
> the MDS codebase a lot easier to work with. Interfaces, if not exactly
> bright and shining, are a lot cleaner than they used to be. Locking is a
> lot more explicit and easier to reason about in many places. There are
> fewer special paths for specific kinds of operations, and a lot more shared
> paths that everything goes through — which means we have more invariants we
> can assume on every operation.
>
> ===========================================================================
> Notable bug reductions
>
> Although we continue to leave snapshots disabled by default and don’t
> recommend multi-MDS systems, both of these have been *dramatically*
> improved by Zheng’s hard work. Our multimds suite now passes almost all of
> the existing tests, whereas it previously failed most of them (
> http://pulpito.ceph.com/?suite=multimds). Our snapshot tests pass
> reliably and using them is no longer a shortcut to breaking your system,
> and bugs are less likely to leave your entire filesystem inaccessible.
>
>
> There’s a lot more I haven’t discussed above, like how the entire stack is
> a lot more tolerant of failures elsewhere than it used to be and so bugs
> are less likely to make your entire filesystem inaccessible. But those are
> some of the biggest features and improvements that users are likely to
> notice or might have been waiting on before they decided to test it out.
> It’s nice to reflect occasionally — I knew we were getting a lot done, but
> this list is much longer than I’d initially thought it would be!
> -Greg
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to