On Thu, 2015-10-22 at 02:12 +0000, Allen Samuels wrote: > One of the biggest changes that flash is making in the storage world is that > the way basic trade-offs in storage management software architecture are > being affected. In the HDD world CPU time per IOP was relatively > inconsequential, i.e., it had little effect on overall performance which was > limited by the physics of the hard drive. Flash is now inverting that > situation. When you look at the performance levels being delivered in the > latest generation of NVMe SSDs you rapidly see that that storage itself is > generally no longer the bottleneck (speaking about BW, not latency of course) > but rather it's the system sitting in front of the storage that is the > bottleneck. Generally it's the CPU cost of an IOP. > > When Sandisk first starting working with Ceph (Dumpling) the design of > librados and the OSD lead to the situation that the CPU cost of an IOP was > dominated by context switches and network socket handling. Over time, much of > that has been addressed. The socket handling code has been re-written (more > than once!) some of the internal queueing in the OSD (and the associated > context switches) have been eliminated. As the CPU costs have dropped, > performance on flash has improved accordingly. > > Because we didn't want to completely re-write the OSD (time-to-market and > stability drove that decision), we didn't move it from the current "thread > per IOP" model into a truly asynchronous "thread per CPU core" model that > essentially eliminates context switches in the IO path. But a fully optimized > OSD would go down that path (at least part-way). I believe it's been proposed > in the past. Perhaps a hybrid "fast-path" style could get most of the > benefits while preserving much of the legacy code. >
+1 It not just reducing context switches but also about removing contention and data copies and getting better cache utilization. Scylladb just did this to cassandra (using seastar library): http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/ Orit > I believe this trend toward thread-per-core software development will also > tend to support the "do it in user-space" trend. That's because most of the > kernel and file-system interface is architected around the blocking > "thread-per-IOP" model and is unlikely to change in the future. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > [email protected] > > -----Original Message----- > From: Martin Millnert [mailto:[email protected]] > Sent: Thursday, October 22, 2015 6:20 AM > To: Mark Nelson <[email protected]> > Cc: Ric Wheeler <[email protected]>; Allen Samuels > <[email protected]>; Sage Weil <[email protected]>; > [email protected] > Subject: Re: newstore direction > > Adding 2c > > On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > > My thought is that there is some inflection point where the userland > > kvstore/block approach is going to be less work, for everyone I think, > > than trying to quickly discover, understand, fix, and push upstream > > patches that sometimes only really benefit us. I don't know if we've > > truly hit that that point, but it's tough for me to find flaws with > > Sage's argument. > > Regarding the userland / kernel land aspect of the topic, there are further > aspects AFAIK not yet addressed in the thread: > In the networking world, there's been development on memory mapped (multiple > approaches exist) userland networking, which for packet management has the > benefit of - for very, very specific applications of networking code - > avoiding e.g. per-packet context switches etc, and streamlining processor > cache management performance. People have gone as far as removing CPU cores > from CPU scheduler to completely dedicate them to the networking task at hand > (cache optimizations). There are various latency/throughput (bulking) > optimizations applicable, but at the end of the day, it's about keeping the > CPU bus busy with "revenue" bus traffic. > > Granted, storage IO operations may be much heavier in cycle counts for > context switches to ever appear as a problem in themselves, certainly for > slower SSDs and HDDs. However, when going for truly high performance IO, > *every* hurdle in the data path counts toward the total latency. > (And really, high performance random IO characteristics approaches the > networking, per-packet handling characteristics). Now, I'm not really > suggesting memory-mapping a storage device to user space, not at all, but > having better control over the data path for a very specific use case, > reduces dependency on the code that works as best as possible for the general > case, and allows for very purpose-built code, to address a narrow set of > requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) > It also decouples dependencies on users i.e. > waiting for the next distro release before being able to take up the benefits > of improvements to the storage code. > > A random google came up with related data on where "doing something way > different" /can/ have significant benefits: > http://phunq.net/pipermail/tux3/2015-April/002147.html > > I (FWIW) certainly agree there is merit to the idea. > The scientific approach here could perhaps be to simply enumerate all corner > cases of "generic FS" that actually are cause for the experienced issues, and > assess probability of them being solved (and if so when). > That *could* improve chances of approaching consensus which wouldn't hurt I > suppose? > > BR, > Martin > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > > NrybXǧv^){.n+z]z{ayʇڙ,jfhzwj:+vwjmzZ+ݢj"! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
