[ 
https://issues.apache.org/jira/browse/CASSANDRA-10989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Murukesh Mohanan updated CASSANDRA-10989:
-----------------------------------------
    Description: 
Since its inception, Cassandra has been utilising 
[SEDA|http://www.mdw.la/papers/seda-sosp01.pdf] at its core.

As originally conceived, it means every request is split into several stages, 
and each stage is backed by a thread pool. That imposes certain challenges:
- thread parking/unparking overheads (partially improved by SEPExecutor in 
CASSANDRA-4718)
- extensive context switching (i-/d- caches thrashing)
- less than optimal multiple writer/multiple reader data structures for 
memtables, partitions, metrics, more
- hard to grok concurrent code
- large number of GC roots, longer TTSP
- increased complexity for moving data structures off java heap
- inability to easily balance writes/reads/compaction/flushing

Latency implications of SEDA have been acknowledged by the authors themselves - 
see 2010 [retrospective on 
SEDA|http://matt-welsh.blogspot.co.uk/2010/07/retrospective-on-seda.html].

To fix these issues (and more), two years ago at NGCC [~benedict] suggested 
moving Cassandra away from SEDA to the more mechanically sympathetic thread per 
core architecture (TPC). See the slides from the original presentation 
[here|https://docs.google.com/presentation/d/19_U8I7mq9JKBjgPmmi6Hri3y308QEx1FmXLt-53QqEw/edit?ts=56265eb4#slide=id.g98ad32b25_1_19].

In a nutshell, each core would become a logical shared nothing micro instance 
of Cassandra, taking over a portion of the node’s range {{*}}.

Client connections will be assigned randomly to one of the cores (sharing a 
single listen socket). A request that cannot be served by the client’s core 
will be proxied to the one owning the data, similar to the way we perform 
remote coordination today.

Each thread (pinned to an exclusive core) would have a single event loop, and 
be responsible for both serving requests and performing maintenance tasks 
(flushing, compaction, repair), scheduling them intelligently.

One notable exception from the original proposal is that we cannot, 
unfortunately, use linux AIO for file I/O, as it's only properly implemented 
for xfs. We might, however, have a specialised implementation for xfs and 
Windows (based on IOCP) later. In the meantime, we have no other choice other 
than to hand off I/O that cannot be served from cache to a separate threadpool.

Transitioning from SEDA to TPC will be done in stages, incrementally and in 
parallel.

This is a high-level overview meta-ticket that will track JIRA issues for each 
individual stage.

{{*}} they’ll share certain things still, like schema, gossip, file I/O 
threadpool(s), and maybe MessagingService.

  was:
Since its inception, Cassandra has been utilising [SEDA 
|http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf] at its core.

As originally conceived, it means every request is split into several stages, 
and each stage is backed by a thread pool. That imposes certain challenges:
- thread parking/unparking overheads (partially improved by SEPExecutor in 
CASSANDRA-4718)
- extensive context switching (i-/d- caches thrashing)
- less than optimal multiple writer/multiple reader data structures for 
memtables, partitions, metrics, more
- hard to grok concurrent code
- large number of GC roots, longer TTSP
- increased complexity for moving data structures off java heap
- inability to easily balance writes/reads/compaction/flushing

Latency implications of SEDA have been acknowledged by the authors themselves - 
see 2010 [retrospective on 
SEDA|http://matt-welsh.blogspot.co.uk/2010/07/retrospective-on-seda.html].

To fix these issues (and more), two years ago at NGCC [~benedict] suggested 
moving Cassandra away from SEDA to the more mechanically sympathetic thread per 
core architecture (TPC). See the slides from the original presentation 
[here|https://docs.google.com/presentation/d/19_U8I7mq9JKBjgPmmi6Hri3y308QEx1FmXLt-53QqEw/edit?ts=56265eb4#slide=id.g98ad32b25_1_19].

In a nutshell, each core would become a logical shared nothing micro instance 
of Cassandra, taking over a portion of the node’s range {{*}}.

Client connections will be assigned randomly to one of the cores (sharing a 
single listen socket). A request that cannot be served by the client’s core 
will be proxied to the one owning the data, similar to the way we perform 
remote coordination today.

Each thread (pinned to an exclusive core) would have a single event loop, and 
be responsible for both serving requests and performing maintenance tasks 
(flushing, compaction, repair), scheduling them intelligently.

One notable exception from the original proposal is that we cannot, 
unfortunately, use linux AIO for file I/O, as it's only properly implemented 
for xfs. We might, however, have a specialised implementation for xfs and 
Windows (based on IOCP) later. In the meantime, we have no other choice other 
than to hand off I/O that cannot be served from cache to a separate threadpool.

Transitioning from SEDA to TPC will be done in stages, incrementally and in 
parallel.

This is a high-level overview meta-ticket that will track JIRA issues for each 
individual stage.

{{*}} they’ll share certain things still, like schema, gossip, file I/O 
threadpool(s), and maybe MessagingService.


> Move away from SEDA to TPC
> --------------------------
>
>                 Key: CASSANDRA-10989
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10989
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Aleksey Yeschenko
>            Priority: Major
>              Labels: performance
>
> Since its inception, Cassandra has been utilising 
> [SEDA|http://www.mdw.la/papers/seda-sosp01.pdf] at its core.
> As originally conceived, it means every request is split into several stages, 
> and each stage is backed by a thread pool. That imposes certain challenges:
> - thread parking/unparking overheads (partially improved by SEPExecutor in 
> CASSANDRA-4718)
> - extensive context switching (i-/d- caches thrashing)
> - less than optimal multiple writer/multiple reader data structures for 
> memtables, partitions, metrics, more
> - hard to grok concurrent code
> - large number of GC roots, longer TTSP
> - increased complexity for moving data structures off java heap
> - inability to easily balance writes/reads/compaction/flushing
> Latency implications of SEDA have been acknowledged by the authors themselves 
> - see 2010 [retrospective on 
> SEDA|http://matt-welsh.blogspot.co.uk/2010/07/retrospective-on-seda.html].
> To fix these issues (and more), two years ago at NGCC [~benedict] suggested 
> moving Cassandra away from SEDA to the more mechanically sympathetic thread 
> per core architecture (TPC). See the slides from the original presentation 
> [here|https://docs.google.com/presentation/d/19_U8I7mq9JKBjgPmmi6Hri3y308QEx1FmXLt-53QqEw/edit?ts=56265eb4#slide=id.g98ad32b25_1_19].
> In a nutshell, each core would become a logical shared nothing micro instance 
> of Cassandra, taking over a portion of the node’s range {{*}}.
> Client connections will be assigned randomly to one of the cores (sharing a 
> single listen socket). A request that cannot be served by the client’s core 
> will be proxied to the one owning the data, similar to the way we perform 
> remote coordination today.
> Each thread (pinned to an exclusive core) would have a single event loop, and 
> be responsible for both serving requests and performing maintenance tasks 
> (flushing, compaction, repair), scheduling them intelligently.
> One notable exception from the original proposal is that we cannot, 
> unfortunately, use linux AIO for file I/O, as it's only properly implemented 
> for xfs. We might, however, have a specialised implementation for xfs and 
> Windows (based on IOCP) later. In the meantime, we have no other choice other 
> than to hand off I/O that cannot be served from cache to a separate 
> threadpool.
> Transitioning from SEDA to TPC will be done in stages, incrementally and in 
> parallel.
> This is a high-level overview meta-ticket that will track JIRA issues for 
> each individual stage.
> {{*}} they’ll share certain things still, like schema, gossip, file I/O 
> threadpool(s), and maybe MessagingService.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to