Ah, except for the snapmapper. We can split the snapmapper in the same way, though, as long as we are careful with the name. -Sam
On Thu, Oct 22, 2015 at 4:42 PM, Samuel Just <[email protected]> wrote: > Since the changes which moved the pg log and the pg info into the pg > object space, I think it's now the case that any transaction submitted > to the objectstore updates a disjoint range of objects determined by > the sequencer. It might be easier to exploit that parallelism if we > control allocation and allocation related metadata. We could split > the store into N pieces which partition the pg space (one additional > one for the meta sequencer?) with one rocksdb instance for each. > Space could then be parcelled out in large pieces (small frequency of > global allocation decisions) and managed more finely within each > partition. The main challenge would be avoiding internal > fragmentation of those, but at least defragmentation can be managed on > a per-partition basis. Such parallelism is probably necessary to > exploit the full throughput of some ssds. > -Sam > > On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI > <[email protected]> wrote: >> Hi Sage and other fellow cephers, >> I truly share the pains with you all about filesystem while I am working >> on objectstore to improve the performance. As mentioned , there is nothing >> wrong with filesystem. Just the Ceph as one of use case need more supports >> but not provided in near future by filesystem no matter what reasons. >> >> There are so many techniques pop out which can help to improve >> performance of OSD. User space driver(DPDK from Intel) is one of them. It >> not only gives you the storage allocator, also gives you the thread >> scheduling support, CPU affinity , NUMA friendly, polling which might >> fundamentally change the performance of objectstore. It should not be hard >> to improve CPU utilization 3x~5x times, higher IOPS etc. >> I totally agreed that goal of filestore is to gives enough support for >> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new >> design goal of objectstore should focus on giving the best performance for >> OSD with new techniques. These two goals are not going to conflict with each >> other. They are just for different purposes to make Ceph not only more >> stable but also better. >> >> Scylla mentioned by Orit is a good example . >> >> Thanks all. >> >> Regards, >> James >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Sage Weil >> Sent: Thursday, October 22, 2015 5:50 AM >> To: Ric Wheeler >> Cc: Orit Wasserman; [email protected] >> Subject: Re: newstore direction >> >> On Wed, 21 Oct 2015, Ric Wheeler wrote: >>> You will have to trust me on this as the Red Hat person who spoke to >>> pretty much all of our key customers about local file systems and >>> storage - customers all have migrated over to using normal file systems >>> under Oracle/DB2. >>> Typically, they use XFS or ext4. I don't know of any non-standard >>> file systems and only have seen one account running on a raw block >>> store in 8 years >>> :) >>> >>> If you have a pre-allocated file and write using O_DIRECT, your IO >>> path is identical in terms of IO's sent to the device. >>> >>> If we are causing additional IO's, then we really need to spend some >>> time talking to the local file system gurus about this in detail. I >>> can help with that conversation. >> >> If the file is truly preallocated (that is, prewritten with zeros... >> fallocate doesn't help here because the extents is marked unwritten), then >> sure: there is very little change in the data path. >> >> But at that point, what is the point? This only works if you have one (or a >> few) huge files and the user space app already has all the complexity of a >> filesystem-like thing (with its own internal journal, allocators, garbage >> collection, etc.). Do they just do this to ease administrative tasks like >> backup? >> >> >> This is the fundamental tradeoff: >> >> 1) We have a file per object. We fsync like crazy and the fact that there >> are two independent layers journaling and managing different types of >> consistency penalizes us. >> >> 1b) We get clever and start using obscure and/or custom ioctls in the file >> system to work around what it is used to: we swap extents to avoid >> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, >> batch fsync, O_ATOMIC, setext ioctl, etc. >> >> 2) We preallocate huge files and write a user-space object system that lives >> within it (pretending the file is a block device). The file system rarely >> gets in the way (assuming the file is prewritten and we don't do anything >> stupid). But it doesn't give us anything a block device wouldn't, and it >> doesn't save us any complexity in our code. >> >> At the end of the day, 1 and 1b are always going to be slower than 2. >> And although 1b performs a bit better than 1, it has similar (user-space) >> complexity to 2. On the other hand, if you step back and view teh entire >> stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... >> and yet still slower. Given we ultimately have to support both (both as an >> upstream and as a distro), that's not very attractive. >> >> Also note that every time we have strayed off the reservation from the >> beaten path (1) to anything mildly exotic (1b) we have been bitten by >> obscure file systems bugs. And that's assume we get everything we need >> upstream... which is probably a year's endeavour. >> >> Don't get me wrong: I'm all for making changes to file systems to better >> support systems like Ceph. Things like O_NOCMTIME and O_ATOMIC make a huge >> amount of sense of a ton of different systems. But our situations is a bit >> different: we always own the entire device (and often the server), so there >> is no need to share with other users or apps (and when you do, you just use >> the existing FileStore backend). And as you know performance is a huge pain >> point. We are already handicapped by virtue of being distributed and >> strongly consistent; we can't afford to give away more to a storage layer >> that isn't providing us much (or the right) value. >> >> And I'm tired of half measures. I want the OSD to be as fast as we can make >> it given the architectural constraints (RADOS consistency and ordering >> semantics). This is truly low-hanging fruit: it's modular, self-contained, >> pluggable, and this will be my third time around this particular block. >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the >> body of a message to [email protected] More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to [email protected] >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
