Thanks for all your hard work on CephFS. This progress is very exciting to hear about. I am constantly amazed at the amount of work that gets done in Ceph in so short an amount of time.
On Mon, Apr 20, 2015 at 6:26 PM, Gregory Farnum <[email protected]> wrote: > We’ve been hard at work on CephFS over the last year since Firefly was > released, and with Hammer coming out it seemed like a good time to go over > some of the big developments users will find interesting. Much of this is > cribbed from John’s Linux Vault talk ( > http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf), > in addition to the release notes ( > http://ceph.com/docs/master/release-notes/). > =========================================================================== > New Filesystem features & improvements: > > ceph-fuse has gained support for fcntl and flock locking. (Yan, Zheng) > This has been in the kernel for a while but nobody had done the work to > implement tracking structures and wire it up in userspace. > > ceph-fuse has gained support for soft quotas, enforced on the client side. > (Yunchuan Wen) The Ubuntu Kylin guys worked on this for quite a while and > we thank them for their work and their patience. You can now specify soft > quotas on a directory and ceph-fuse will behave as you’d expect from that. > > Hadoop support has been generally improved and updated. (Noah Watkins, > Huamin Chen) It now works against the 2.0 API, the tests we run in our lab > are more sophisticated, and it’s a lot friendlier to install with Maven and > other Java tools. Noah’s still doing work on this to make it as turnkey as > possible, but soon you’ll just need to drop a single JAR on the system > (this will include the libcephfs stuff, so you don’t even need to worry > about those packages and compatibility!) and change a few config options. > > ceph-fuse and CephFS as a whole now have much-improved full space > handling. If you run out of space at the RADOS layer you will get ENOSPC > errors in the client (instead of it retrying indefinitely), and these > errors (and others) are now propagated out to fsync and fclose calls. > > We are now much more consistent in our handling of timestamps. Previously > we attempted to take the time from whichever process was responsible for > making a change, which could be either a client or the MDS. But this was > troublesome if their times weren’t synced — made worse by trying not to let > the time move backwards — and some applications which relied on sharing > mtime and ctime values as versions (Hadoop and rsync both did this in > certain configurations) were unhappy. We now use a timestamp provided by > the client for all operations, which has been more stable. > > Certain internal data structures are now much more scalable on a > per-client level. We had issues when certain “MDSTables” got too large, but > John Spray sorted them out. > > The reconnect phase, when an MDS is restarted or dies and the clients have > to connect to a different daemon, has been made much faster in the typical > case. (Yan, Zheng) > > =========================================================================== > Administrator features & improvements: > > The MDS has gained an OpTracker, with functionality similar to that in the > OSD. You can dump in-flight requests and notably slow ones from the recent > past. The changes to enable this also made working with many code paths a > lot easier. > > We’ve changed how you create and manage CephFS file systems in a cluster. > (John Spray) The “data” and “metadata” pools are no longer created by > default, and the management is done via monitor commands that start with > “ceph fs” (eg, “ceph fs new”). These have been designed with future > extensions in mind, but for now they mostly replicate existing features > with more consistency and improved repeatability/idempotency. > > The MDS now reports on a variety of health metrics to the monitor, joining > the existing OSD and monitor health reports. These include information on > misbehaving clients and MDS data structures. (John Spray) > > The MDS admin socket now includes a bunch of new commands. You can examine > and evict client sessions, plus do things around filesystem repair (see > below). > > The MDS now gathers metadata from the clients about who they are and > shares that with users via a variety of helpful interfaces and warning > messages. (John Spray) > > =========================================================================== > Recovery tools > > We have a new MDS journal format and a new cephfs-journal-tool. (John > Spray) This eliminates the days of needing to hexedit a journal dump in > order to let your MDS start back up — you can inspect the journal state > (human-readable or json, great for our testing!) and make changes on a > per-event level. It also includes the ability to scan through hopelessly > broken journals and parse out whatever data is available for flushing to > the backing RADOS objects. > Similarly, there’s a cephfs-table-tool for working with the SessionTable, > InoTable, and SnapTable. (John Spray) > > We’ve added new “scrub_path” and “flush_path” commands to the admin > socket. These are fairly limited right now but will check that both > directories and files are self-consistent. It’s a building block for the > "forward scrub" and fsck features that I’ve been working on, and includes a > lot of code-level work to enable those. > > =========================================================================== > Performance improvements > > Both the kernel and userspace clients are a lot more efficient with some > of their “capability” and directory content handling. This lets them serve > a lot more out of local cache, a lot more often, than they were able to > previously. This is particularly noticeable in workloads where a single > client “owned” a directory but another client periodically peeked in on it. > There are also a bunch of extra improvements in this area that have gone > in since Hammer and will be released in Infernalis. ;) > > The code in the MDS that handles the journaling has been split into a > separate thread. (Yan, Zheng) This has increased maximum throughput a fair > bit and is the first major improvement enabled by John’s work to start > breaking down the big MDS lock. (We still have a big MDS lock, but in > addition to the journal it no longer covers the Objecter. Setting up the > interfaces to make that manageable should make future lock sharding and > changes a lot simpler than they would have been previously.) > > =========================================================================== > Developer & test improvements > > In addition to a slightly expanded set of black-box tests, we are now > testing FS behaviors to make sure everything behaves as expected in > specific scenarios (failure and otherwise). This is largely thanks to John, > but we’re doing more with it in general as we add features that can be > tested this way. > > As alluded to in previous sections, we’ve done a lot of work that makes > the MDS codebase a lot easier to work with. Interfaces, if not exactly > bright and shining, are a lot cleaner than they used to be. Locking is a > lot more explicit and easier to reason about in many places. There are > fewer special paths for specific kinds of operations, and a lot more shared > paths that everything goes through — which means we have more invariants we > can assume on every operation. > > =========================================================================== > Notable bug reductions > > Although we continue to leave snapshots disabled by default and don’t > recommend multi-MDS systems, both of these have been *dramatically* > improved by Zheng’s hard work. Our multimds suite now passes almost all of > the existing tests, whereas it previously failed most of them ( > http://pulpito.ceph.com/?suite=multimds). Our snapshot tests pass > reliably and using them is no longer a shortcut to breaking your system, > and bugs are less likely to leave your entire filesystem inaccessible. > > > There’s a lot more I haven’t discussed above, like how the entire stack is > a lot more tolerant of failures elsewhere than it used to be and so bugs > are less likely to make your entire filesystem inaccessible. But those are > some of the biggest features and improvements that users are likely to > notice or might have been waiting on before they decided to test it out. > It’s nice to reflect occasionally — I knew we were getting a lot done, but > this list is much longer than I’d initially thought it would be! > -Greg > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
