Re: scylladb

Dor Laor Sat, 11 Mar 2017 23:26:29 -0800

On Sat, Mar 11, 2017 at 2:19 PM, Kant Kodali <k...@peernova.com> wrote:


> My response is inline.
>
> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity <a...@scylladb.com> wrote:
>
>> There are several issues at play here.
>>
>> First, a database runs a large number of concurrent operations, each of
>> which only consumes a small amount of CPU. The high concurrency is need to
>> hide latency: disk latency, or the latency of contacting a remote node.
>>
>
> *Ok so you are talking about hiding I/O latency.  If all these I/O are
> non-blocking system calls then a thread per core and callback mechanism
> should suffice isn't it?*
>

In general, yes but in practice it's more complicated.
Each such thread runs different tasks, you need a mechanism to switch
between these
tasks, this is the seastar continuation engine in our case. However, things
get more
complicated. We found that we need a cpu scheduler which takes into account
the priority
of different tasks, such as repair, compaction, streaming, read operations
and write operations.
We always prioritize foreground operations over background ones and thus
even when we
repair TBs of data, latency is still very low (this feature is coming in
Scylla 1.8)



>
>
>> This means that the scheduler will need to switch contexts very often. A
>> kernel thread scheduler knows very little about the application, so it has
>> to switch a lot of context.  A user level scheduler is tightly bound to the
>> application, so it can perform the switching faster.
>>
>
> *sure but this applies in other direction as well. A user level scheduler
> has no idea about kernel level scheduler either.  There is literally no
> coordination between kernel level scheduler and user level scheduler in
> linux or any major OS. It may be possible with OS's *
>

Correct. That's why we let the OS scheduler to run just one thread per core
and we bind the thread to the cpu. Inside, we do our own stuff with the
seastar scheduler and the OS doesn't know and doesn't care.

More below


> *that support scheduler activation(LWP's) and upcall mechanism. Even then
> it is hard to say if it is all worth it (The research shows performance may
> not outweigh the complexity). Golang problem is exactly this if one creates
> 1000 go routines/green threads where each of them is making a blocking
> system call then it would create 1000 kernel threads underneath because it
> has no way to know that the kernel thread is blocked (no upcall). And in
> non-blocking case I still don't even see a significant performance when
> compared to few kernel threads with callback mechanism.  If you are saying
> user level scheduling is the Future (perhaps I would just let the
> researchers argue about it) As of today that is not case else languages
> would have had it natively instead of using third party frameworks or
> libraries. *
>

That's why we do not run blocking system calls at all. We had to limit
ourselves to the XFS filesystem
only since the others did have got AIO support. Recently we bypassed some
of the issues which
made EXT4 to block and it may be ok with our AIO pattern.

We even write a DNS implementation that doesn't block and doesn't lock (for
us, even a library that uses spin locks under the hood is bad).

Bare in mind that the whole thing is simple to run and the user doesn't
need to know anything of this complexity.




>
>
>> There are also implications on the concurrency primitives in use (locks
>> etc.) -- they will be much faster for the user-level scheduler, because
>> they cooperate with the scheduler.  For example, no atomic
>> read-modify-write instructions need to be executed.
>>
>
>
>      Second, how many (kernel) threads should you run?* This question one
> will always have. If there are 10K user level threads that maps to only one
> kernel thread then they cannot exploit parallelism. so there is no right
> answer but a thread per core is a reasonable/good choice. *
>

+1


>
>
>> If you run too few threads, then you will not be able to saturate the CPU
>> resources.  This is a common problem with Cassandra -- it's very hard to
>> get it to consume all of the CPU power on even a moderately large machine.
>> On the other hand, if you have too many threads, you will see latency rise
>> very quickly, because kernel scheduling granularity is on the order of
>> milliseconds.  User-level scheduling, because it leaves control in the hand
>> of the application, allows you to both saturate the CPU and maintain low
>> latency.
>>
>
>     F*or my workload and probably others I had seen Cassandra was always
> been CPU bound.*
>

Could be. However, try to make it CPU bound on 10 core, 20 core and more.
The more core you use, the less nodes you need and the overall overhead
decreases.


>
>> There are other factors, like NUMA-friendliness, but in the end it all
>> boils down to efficiency and control.
>>
>> None of this is new btw, it's pretty common in the storage world.
>>
>> Avi
>>
>>
>> On 03/11/2017 11:18 PM, Kant Kodali wrote:
>>
>> Here is the Java version http://docs.paralleluniverse.co/quasar/ but I
>> still don't see how user level scheduling can be beneficial (This is a well
>> debated problem)? How can this add to the performance? or say why is user
>> level scheduling necessary Given the Thread per core design and the
>> callback mechanism?
>>
>> On Sat, Mar 11, 2017 at 12:51 PM, Avi Kivity <a...@scylladb.com> wrote:
>>
>>> Scylla uses a the seastar framework, which provides for both user-level
>>> thread scheduling and simple run-to-completion tasks.
>>>
>>> Huge pages are limited to 2MB (and 1GB, but these aren't available as
>>> transparent hugepages).
>>>
>>>
>>> On 03/11/2017 10:26 PM, Kant Kodali wrote:
>>>
>>> @Dor
>>>
>>> 1) You guys have a CPU scheduler? you mean user level thread Scheduler
>>> that maps user level threads to kernel level threads? I thought C++ by
>>> default creates native kernel threads but sure nothing will stop someone to
>>> create a user level scheduling library if that's what you are talking about?
>>> 2) How can one create THP of size 1KB? According to this post
>>> <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-transhuge.html>
>>>  it
>>> looks like the valid values 2MB and 1GB.
>>>
>>> Thanks,
>>> kant
>>>
>>> On Sat, Mar 11, 2017 at 11:41 AM, Avi Kivity <a...@scylladb.com> wrote:
>>>
>>>> Agreed, I'd recommend to treat benchmarks as a rough guide to see where
>>>> there is potential, and follow through with your own tests.
>>>>
>>>> On 03/11/2017 09:37 PM, Edward Capriolo wrote:
>>>>
>>>>
>>>> Benchmarks are great for FUDly blog posts. Real world work loads matter
>>>> more. Every NoSQL vendor wins their benchmarks.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: scylladb

Reply via email to