Re: scylladb

Avi Kivity Sun, 12 Mar 2017 00:45:07 -0800


On 03/12/2017 12:19 AM, Kant Kodali wrote:

My response is inline.
On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <a...@scylladb.com<mailto:a...@scylladb.com>> wrote:
    There are several issues at play here.

    First, a database runs a large number of concurrent operations,
    each of which only consumes a small amount of CPU. The high
    concurrency is need to hide latency: disk latency, or the latency
    of contacting a remote node.
*Ok so you are talking about hiding I/O latency. If all these I/O arenon-blocking system calls then a thread per core and callbackmechanism should suffice isn't it?*

Scylla uses a mix of user-level threads and callbacks. Most of the codeuses callbacks (fronted by a future/promise API). SSTable writers(memtable flush, compaction) use a user-level thread (internallyimplemented using callbacks). The important bit is multiplexing manyconcurrent operations onto a single kernel thread.

    This means that the scheduler will need to switch contexts very
    often. A kernel thread scheduler knows very little about the
    application, so it has to switch a lot of context.  A user level
    scheduler is tightly bound to the application, so it can perform
    the switching faster.
*sure but this applies in other direction as well. A user levelscheduler has no idea about kernel level scheduler either. There isliterally no coordination between kernel level scheduler and userlevel scheduler in linux or any major OS. It may be possible with OS'sthat support scheduler activation(LWP's) and upcall mechanism. *

There is no need for coordination, because the kernel scheduler has noscheduling decisions to make. With one thread per core, bound to itscore, the kernel scheduler can't make the wrong decision because it hasjust one choice.

*Even then it is hard to say if it is all worth it (The research showsperformance may not outweigh the complexity). Golang problem isexactly this if one creates 1000 go routines/green threads where eachof them is making a blocking system call then it would create 1000kernel threads underneath because it has no way to know that thekernel thread is blocked (no upcall). *

All of the significant system calls we issue are through the mainthread, either asynchronous or non-blocking.

*And in non-blocking case I still don't even see a significantperformance when compared to few kernel threads with callback mechanism.*


We do.

* If you are saying user level scheduling is the Future (perhaps Iwould just let the researchers argue about it) As of today that is notcase else languages would have had it natively instead of using thirdparty frameworks or libraries.
*

User-level scheduling is great for high performance I/O intensiveapplications like databases and file systems. It's not a generalsolution, and it involves a lot of effort to set up the infrastructure.However, for our use case, it was worth it.

    There are also implications on the concurrency primitives in use
    (locks etc.) -- they will be much faster for the user-level
    scheduler, because they cooperate with the scheduler.  For
    example, no atomic read-modify-write instructions need to be executed.
Second, how many (kernel) threads should you run?*This questionone will always have. If there are 10K user level threads that maps toonly one kernel thread then they cannot exploit parallelism. so thereis no right answer but a thread per core is a reasonable/good choice.
*

Only if you can multiplex many operations on top of each of thosethreads. Otherwise, the CPUs end up underutilized.

    If you run too few threads, then you will not be able to saturate
    the CPU resources.  This is a common problem with Cassandra --
    it's very hard to get it to consume all of the CPU power on even a
    moderately large machine. On the other hand, if you have too many
    threads, you will see latency rise very quickly, because kernel
scheduling granularity is on the order of milliseconds.User-level scheduling, because it leaves control in the hand of
    the application, allows you to both saturate the CPU and maintain
    low latency.
F*or my workload and probably others I had seen Cassandra wasalways been CPU bound.*

Yes, but does it consume 100% of all of the cores on your machine?Cassandra generally doesn't (on a larger machine), and when you profileit, you see it spending much of its time in atomic operations, orparking/unparking threads -- fighting with itself. It doesn't scalewithin the machine. Scylla will happily utilize all of the cores thatit is assigned (all of them by default in most configurations), and thebigger the machine you give it, the happier it will be.

    There are other factors, like NUMA-friendliness, but in the end it
    all boils down to efficiency and control.

    None of this is new btw, it's pretty common in the storage world.

    Avi


    On 03/11/2017 11:18 PM, Kant Kodali wrote:

    Here is the Java version http://docs.paralleluniverse.co/quasar/
    <http://docs.paralleluniverse.co/quasar/> but I still don't see
    how user level scheduling can be beneficial (This is a well
    debated problem)? How can this add to the performance? or say why
    is user level scheduling necessary Given the Thread per core
    design and the callback mechanism?

    On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity <a...@scylladb.com
    <mailto:a...@scylladb.com>> wrote:

        Scylla uses a the seastar framework, which provides for both
        user-level thread scheduling and simple run-to-completion tasks.

        Huge pages are limited to 2MB (and 1GB, but these aren't
        available as transparent hugepages).


        On 03/11/2017 10:26 PM, Kant Kodali wrote:

        @Dor

        1) You guys have a CPU scheduler? you mean user level thread
        Scheduler that maps user level threads to kernel level
        threads? I thought C++ by default creates native kernel
        threads but sure nothing will stop someone to create a user
        level scheduling library if that's what you are talking about?
        2) How can one create THP of size 1KB? According to this
        post
        
<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html>
 it
        looks like the valid values 2MB and 1GB.

        Thanks,
        kant

        On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity
        <a...@scylladb.com <mailto:a...@scylladb.com>> wrote:

            Agreed, I'd recommend to treat benchmarks as a rough
            guide to see where there is potential, and follow
            through with your own tests.

            On 03/11/2017 09:37 PM, Edward Capriolo wrote:


            Benchmarks are great for FUDly blog posts. Real world
            work loads matter more. Every NoSQL vendor wins their
            benchmarks.

Re: scylladb

Reply via email to