@michael > If youre talking about the solution that you regressed just before we released. Then we did test it in our real testing env. I didnt notice and negative impact for our use cases.
That's nice to be heard, MIchael. Probably I could have avoided to revert it, but consider that the accademic use case ie 1 P/1 C/1 Q is just a way to emulate a "low-utilization" case. As Christopher has said, there are users that don't want to trade low utilization performance...and I understand it. The thing is, have you seen if using TranferLinkedQueue has improved over the original thread pool with many producers/consumers/queues scenario? > Regards to code, i actually thought what you had it made code cleaner. That's another benefit, but the change was meant to provide only benefits, that's why I've reverted it... > think we engineer for multiple consumers/multiple producers and should test with such setup. I agree with you, but it hurts my heart to deliver a patch that cause perf regressions without providing a huge benefit ie proven in other cases... Thank guys, Franz Il giorno mer 20 mar 2019 alle ore 13:34 <[email protected]> ha scritto: > So Franz. > > > > > If youre talking about the solution that you regressed just before we > released. Then we did test it in our real testing env. I didnt notice and > negative impact for our use cases. > > > > > > > Regards to code, i actually thought what you had it made code cleaner. > > > > > Im not sure what use case your concerned about with your change, generally > i expect brokers to have more than just one queue and just one > producer/consumer in real world use cases. I think having just one > producer, one consumer and one queue on a whole broker is very academic and > not a typical real world case. I think we engineer for multiple > consumers/multiple producers and should test with such setup. > > > > > Get Outlook for Android > > > > > > > > On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" < > [email protected]> wrote: > > > > > > > > > > > > That being said > is there an actual real world throughput issue here? > > Yes and no: it's a chance to improve things, especially for cloud uses: it > is a fact that now that Specter and Meltdown are there we don't want to > waste CPU time > sitting idle/on contention if isn't needed and as I've said "is a giant > lock on any task submitted". > IMO having a talk on how to improve it is not over-engineering, but just > engineering, given that scaling non-persistent messages (or persistent with > very fast disks) > is something that we expect from a broker: from a commercial point of view > is nice that we could scale by adding brokers, but if you can save 2 > machines to get > the same throughput I think is a nice improvement for (m)any users. > > > , I don't > know that I see much value in over engineering and micro managing this > stuff unless there's a real world measurable benefit to be gained vs just > theoretical benchmarks as it's just going to make things harder to maintain > and mistakes easier to make in the future. > > Cassandra from Datastax has gained about 2X throughput by solving this, but > it can be said that's a "different scenario" too: as an engineer I can say > no, is not. > I've "recently" addressed with the client team a similar "issue" on > qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish). > And this "issue" (actually, a "chance to improve things") has been well > hidden altough in front of anyone from a long time: > https://issues.apache.org/jira/browse/QPIDJMS-396. > > The reason why I've written on the dev list is to understand if anyone has > had the chance to measure in a real load scenario something like this. > > > Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon < > [email protected]> ha scritto: > > > I don't think sacrificing low utilization is a good idea. That being > said > > is there an actual real world throughput issue here? In general, I don't > > know that I see much value in over engineering and micro managing this > > stuff unless there's a real world measurable benefit to be gained vs just > > theoretical benchmarks as it's just going to make things harder to > maintain > > and mistakes easier to make in the future. > > > > On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro > > wrote: > > > > > HI folks, > > > > > > I'm writing here to share some thoughts related to the Artemis > threading > > > model and how it affects broker scalability. > > > > > > Currently (on 2.7.0) we relies on a shared thread pool ie > > > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to > > > process tasks. > > > Thanks to the the Actor abstraction we use a lock-free queue to > serialize > > > tasks (or items), > > > processing them in batch in the shared thread pool, awaking a consumer > > > thread only if needed (the logic is contained in ProcessorBase). > > > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will > > > execute on the shared thread pool a specific task to drain and execute > a > > > batch of tasks only if necessary, not on every added task/item. > > > > > > Looking at the contention graphs of the broker (ie the bar width are > the > > > nanoseconds before entering into a lock) is quite clear the limitation > of > > > the current implementation: > > > > > > [image: image.png] > > > > > > In violet are shown the offer and poll operations on the > > > LinkedBlockingQueue of the shared thread pool, happening from any > thread > > of > > > the pool (the thread is the base of each bar, in red). > > > The LinkedBlockingQueue indeed has a ReentrantLock to protect any > > > operation on the linked q and is clear that having a giant lock in > front > > of > > > high contention point won't scale. > > > > > > The above graph has been obtained with a single producer/single > > > consumer/single queue/not-persistent run, but I don't have enough > > resources > > > to check what could happen with more and more > producers/consumers/queues. > > > The critical part is the offering/polling of tasks on the shared thread > > > pool and in theory a maxed-out broker shouldn't have many idle threads > to > > > be awaken, but given that more producers/consumers/queues means many > > > different Actors, in order to guarantee each actor tasks to be > executed, > > > the shared thread pool will need to process many unnecessary "awake" > > tasks, > > > creating lot of contention on the blocking linked q, slowing down the > > > entire broker. > > > > > > In the past I've tried to replace the current shared thread pool > > > implementation with a ForkJoinPool or (the most recent attempt) by > using > > a > > > lock-free q instead of BlockingLinkedQueue, with no success ( > > > https://github.com/apache/activemq-artemis/pull/2582). > > > > > > Below the contention graph using a lock-free q in the shared thread > pool: > > > > > > [image: image.png] > > > > > > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit > > > that are contending QueueImpl lock, but the numbers for each bar are > very > > > different: in the previous graph the contention on the shared thread > pool > > > lock is of 600 ns, while here is 20-80 ns and it can scale with number > of > > > queues, while the previous version not. > > > > > > All green right? So, why I've reverted the lock-free thread pool? > > > > > > Because with a low utilization of the broker (ie 1 producer/1 > consumer/1 > > > queue) the latencies and throughput were actually worse: cpu > utilization > > > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was > > > spending most of its time by awaking the shared thread pool. The same > was > > > happening with a ForkJoin pool, sadly. > > > It seems (and it is just a guess) that, given that tasks get consumed > > > faster (there is no lock preventing them to get polled and executed), > the > > > thread pool is getting idle sooner (the default thread pool size is of > 30 > > > and I have a machine with just 8 real cores), forcing any new task > > > submission to awake any of the thread pool to process incoming tasks. > > > > > > What are your thoughts on this? > > > I don't want to trade so much the "low utilization" performance for the > > > scaling TBH, that's why I've preferred to revert the change. > > > Note that other applications with scalability needs (eg Cassandra) have > > > changed their shared pool approach based on SEDA to a thread-per-pool > > > architecture for this same reason. > > > > > > Cheers, > > > Franz > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
