On Thu, 2015-10-22 at 02:12 +0000, Allen Samuels wrote:
> One of the biggest changes that flash is making in the storage world is that 
> the way basic trade-offs in storage management software architecture are 
> being affected. In the HDD world CPU time per IOP was relatively 
> inconsequential, i.e., it had little effect on overall performance which was 
> limited by the physics of the hard drive. Flash is now inverting that 
> situation. When you look at the performance levels being delivered in the 
> latest generation of NVMe SSDs you rapidly see that that storage itself is 
> generally no longer the bottleneck (speaking about BW, not latency of course) 
> but rather it's the system sitting in front of the storage that is the 
> bottleneck. Generally it's the CPU cost of an IOP.
> 
> When Sandisk first starting working with Ceph (Dumpling) the design of 
> librados and the OSD lead to the situation that the CPU cost of an IOP was 
> dominated by context switches and network socket handling. Over time, much of 
> that has been addressed. The socket handling code has been re-written (more 
> than once!) some of the internal queueing in the OSD (and the associated 
> context switches) have been eliminated. As the CPU costs have dropped, 
> performance on flash has improved accordingly.
> 
> Because we didn't want to completely re-write the OSD (time-to-market and 
> stability drove that decision), we didn't move it from the current "thread 
> per IOP" model into a truly asynchronous "thread per CPU core" model that 
> essentially eliminates context switches in the IO path. But a fully optimized 
> OSD would go down that path (at least part-way). I believe it's been proposed 
> in the past. Perhaps a hybrid "fast-path" style could get most of the 
> benefits while preserving much of the legacy code.
> 

+1
It not just reducing context switches but also about removing contention
and data copies and getting better cache utilization.

Scylladb just did this to cassandra (using seastar library):
http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/

Orit

> I believe this trend toward thread-per-core software development will also 
> tend to support the "do it in user-space" trend. That's because most of the 
> kernel and file-system interface is architected around the blocking 
> "thread-per-IOP" model and is unlikely to change in the future.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> [email protected]
> 
> -----Original Message-----
> From: Martin Millnert [mailto:[email protected]]
> Sent: Thursday, October 22, 2015 6:20 AM
> To: Mark Nelson <[email protected]>
> Cc: Ric Wheeler <[email protected]>; Allen Samuels 
> <[email protected]>; Sage Weil <[email protected]>; 
> [email protected]
> Subject: Re: newstore direction
> 
> Adding 2c
> 
> On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> > My thought is that there is some inflection point where the userland
> > kvstore/block approach is going to be less work, for everyone I think,
> > than trying to quickly discover, understand, fix, and push upstream
> > patches that sometimes only really benefit us.  I don't know if we've
> > truly hit that that point, but it's tough for me to find flaws with
> > Sage's argument.
> 
> Regarding the userland / kernel land aspect of the topic, there are further 
> aspects AFAIK not yet addressed in the thread:
> In the networking world, there's been development on memory mapped (multiple 
> approaches exist) userland networking, which for packet management has the 
> benefit of - for very, very specific applications of networking code - 
> avoiding e.g. per-packet context switches etc, and streamlining processor 
> cache management performance. People have gone as far as removing CPU cores 
> from CPU scheduler to completely dedicate them to the networking task at hand 
> (cache optimizations). There are various latency/throughput (bulking) 
> optimizations applicable, but at the end of the day, it's about keeping the 
> CPU bus busy with "revenue" bus traffic.
> 
> Granted, storage IO operations may be much heavier in cycle counts for 
> context switches to ever appear as a problem in themselves, certainly for 
> slower SSDs and HDDs. However, when going for truly high performance IO, 
> *every* hurdle in the data path counts toward the total latency.
> (And really, high performance random IO characteristics approaches the 
> networking, per-packet handling characteristics).  Now, I'm not really 
> suggesting memory-mapping a storage device to user space, not at all, but 
> having better control over the data path for a very specific use case, 
> reduces dependency on the code that works as best as possible for the general 
> case, and allows for very purpose-built code, to address a narrow set of 
> requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) 
> It also decouples dependencies on users i.e.
> waiting for the next distro release before being able to take up the benefits 
> of improvements to the storage code.
> 
> A random google came up with related data on where "doing something way 
> different" /can/ have significant benefits:
> http://phunq.net/pipermail/tux3/2015-April/002147.html
> 
> I (FWIW) certainly agree there is merit to the idea.
> The scientific approach here could perhaps be to simply enumerate all corner 
> cases of "generic FS" that actually are cause for the experienced issues, and 
> assess probability of them being solved (and if so when).
> That *could* improve chances of approaching consensus which wouldn't hurt I 
> suppose?
> 
> BR,
> Martin
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> NrybXǧv^)޺{.n+z]z{ayʇڙ,jfhzwj:+vwjmzZ+ݢj"!


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to