[ 
https://issues.apache.org/jira/browse/CASSANDRA-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968636#comment-13968636
 ] 

Jason Brown commented on CASSANDRA-4718:
----------------------------------------

As to multi-cpu machines, I spent a lot time thinking about the affects of NUMA 
systems on CAS operations/algs (esp. wrt to FJP, obviously). As I mentioned, 
I'm using systems with two sockets (two NUMA cores). As you get more sockets 
(and thus more numa cores) a thread on one core will be reaching across to more 
cores to do work stealing, thus adding contention to that memory address. 
Imagine four threads on for sockets all contending for work on a fifth thread. 
The memory values for that portion of the queue for that fifth thread is now 
pulled into all four sockets, thus becoming more of a contention point, as well 
as impacting latency (due to the CAS operation). However, this could be (and 
hopefully is) less of a cost than bothering with queues, blocking, posix 
threads, OS interrupts, and everything else that makes standard thread pool 
executors work.

Thinking even crazier to optimize the FJP sharing across numa cores, this is 
when I start thinking about digging up the thread affinity work again, and 
binding threads of similar types (probably by Stage) to sockets, not just an 
individual CPU (I think that was my problem before). But then I wonder how much 
is to be gained on non-NUMA systems or systems where you can't determine if 
it's got NUMA or not (hello, cloud!) - and at that point I'm happy to realize 
the gains we have and move forward.

bq. what problem are you seeing?

Will ping you offline - too unexciting for this space :)

> More-efficient ExecutorService for improved throughput
> ------------------------------------------------------
>
>                 Key: CASSANDRA-4718
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4718
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.1
>
>         Attachments: 4718-v1.patch, PerThreadQueue.java, baq vs trunk.png, op 
> costs of various queues.ods, stress op rate with various queues.ods, 
> v1-stress.out
>
>
> Currently all our execution stages dequeue tasks one at a time.  This can 
> result in contention between producers and consumers (although we do our best 
> to minimize this by using LinkedBlockingQueue).
> One approach to mitigating this would be to make consumer threads do more 
> work in "bulk" instead of just one task per dequeue.  (Producer threads tend 
> to be single-task oriented by nature, so I don't see an equivalent 
> opportunity there.)
> BlockingQueue has a drainTo(collection, int) method that would be perfect for 
> this.  However, no ExecutorService in the jdk supports using drainTo, nor 
> could I google one.
> What I would like to do here is create just such a beast and wire it into (at 
> least) the write and read stages.  (Other possible candidates for such an 
> optimization, such as the CommitLog and OutboundTCPConnection, are not 
> ExecutorService-based and will need to be one-offs.)
> AbstractExecutorService may be useful.  The implementations of 
> ICommitLogExecutorService may also be useful. (Despite the name these are not 
> actual ExecutorServices, although they share the most important properties of 
> one.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to