@michael

> If youre talking about the solution that you regressed just before we
released. Then we did test it in our real testing env. I didnt notice and
negative impact for our use cases.

That's nice to be heard, MIchael. Probably I could have avoided to revert
it, but consider that the accademic use case ie 1 P/1 C/1 Q is just a way
to emulate a "low-utilization" case.
As Christopher has said, there are users that don't want to trade low
utilization performance...and I understand it.
The thing is, have you seen if using TranferLinkedQueue has improved over
the original thread pool with many producers/consumers/queues scenario?

> Regards to code, i actually thought what you had it made code cleaner.

That's another benefit, but the change was meant to provide only benefits,
that's why I've reverted it...

>  think we engineer for multiple consumers/multiple producers and should
test with such setup.
I agree with you, but it hurts my heart to deliver a patch that cause perf
regressions without providing a huge benefit ie proven in other cases...

Thank guys,
Franz




Il giorno mer 20 mar 2019 alle ore 13:34
<[email protected]> ha scritto:

> So Franz.
>
>
>
>
> If youre talking about the solution that you regressed just before we
> released. Then we did test it in our real testing env. I didnt notice and
> negative impact for our use cases.
>
>
>
>
>
>
> Regards to code, i actually thought what you had it made code cleaner.
>
>
>
>
> Im not sure what use case your concerned about with your change, generally
> i expect brokers to have more than just one queue and just one
> producer/consumer in real world use cases. I think having just one
> producer, one consumer and one queue on a whole broker is very academic and
> not a typical real world case. I think we engineer for multiple
> consumers/multiple producers and should test with such setup.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" <
> [email protected]> wrote:
>
>
>
>
>
>
>
>
>
>
> > That being said
> is there an actual real world throughput issue here?
>
> Yes and no: it's a chance to improve things, especially for cloud uses: it
> is a fact that now that Specter and Meltdown are there we don't want to
> waste CPU time
> sitting idle/on contention if isn't needed and as I've said "is a giant
> lock on any task submitted".
> IMO having a talk on how to improve it is not over-engineering, but just
> engineering, given that scaling non-persistent messages (or persistent with
> very fast disks)
> is something that we expect from a broker: from a commercial point of view
> is nice that we could scale by adding brokers, but if you can save 2
> machines to get
> the same throughput I think is a nice improvement for (m)any users.
>
> > , I don't
> know that I see much value in over engineering and micro managing this
> stuff unless there's a real world measurable benefit to be gained vs just
> theoretical benchmarks as it's just going to make things harder to maintain
> and mistakes easier to make in the future.
>
> Cassandra from Datastax has gained about 2X throughput by solving this, but
> it can be said that's a "different scenario" too: as an engineer I can say
> no, is not.
> I've "recently" addressed with the client team a similar "issue" on
> qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish).
> And this "issue" (actually, a "chance to improve things") has been well
> hidden altough in front of anyone from a long time:
> https://issues.apache.org/jira/browse/QPIDJMS-396.
>
> The reason why I've written on the dev list is to understand if anyone has
> had the chance to measure in a real load scenario something like this.
>
>
> Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
> [email protected]> ha scritto:
>
> > I don't think sacrificing low utilization is a good idea.  That being
> said
> > is there an actual real world throughput issue here? In general, I don't
> > know that I see much value in over engineering and micro managing this
> > stuff unless there's a real world measurable benefit to be gained vs just
> > theoretical benchmarks as it's just going to make things harder to
> maintain
> > and mistakes easier to make in the future.
> >
> > On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro
> > wrote:
> >
> > > HI folks,
> > >
> > > I'm writing here to share some thoughts related to the Artemis
> threading
> > > model and how it affects broker scalability.
> > >
> > > Currently (on 2.7.0) we relies on a shared thread pool ie
> > > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> > > process tasks.
> > > Thanks to the the Actor abstraction we use a lock-free queue to
> serialize
> > > tasks (or items),
> > > processing them in batch in the shared thread pool, awaking a consumer
> > > thread only if needed (the logic is contained in ProcessorBase).
> > > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> > > execute on the shared thread pool a specific task to drain and execute
> a
> > > batch of tasks only if necessary, not on every added task/item.
> > >
> > > Looking at the contention graphs of the broker (ie the bar width are
> the
> > > nanoseconds before entering into a lock) is quite clear the limitation
> of
> > > the current implementation:
> > >
> > > [image: image.png]
> > >
> > > In violet are shown the offer and poll operations on the
> > > LinkedBlockingQueue of the shared thread pool, happening from any
> thread
> > of
> > > the pool (the thread is the base of each bar, in red).
> > > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > > operation on the linked q and is clear that having a giant lock in
> front
> > of
> > > high contention point won't scale.
> > >
> > > The above graph has been obtained with a single producer/single
> > > consumer/single queue/not-persistent run, but I don't have enough
> > resources
> > > to check what could happen with more and more
> producers/consumers/queues.
> > > The critical part is the offering/polling of tasks on the shared thread
> > > pool and in theory a maxed-out broker shouldn't have many idle threads
> to
> > > be awaken, but given that more producers/consumers/queues means many
> > > different Actors, in order to guarantee each actor tasks to be
> executed,
> > > the shared thread pool will need to process many unnecessary "awake"
> > tasks,
> > > creating lot of contention on the blocking linked q, slowing down the
> > > entire broker.
> > >
> > > In the past I've tried to replace the current shared thread pool
> > > implementation with a ForkJoinPool or (the most recent attempt) by
> using
> > a
> > > lock-free q instead of BlockingLinkedQueue, with no success (
> > > https://github.com/apache/activemq-artemis/pull/2582).
> > >
> > > Below the contention graph using a lock-free q in the shared thread
> pool:
> > >
> > > [image: image.png]
> > >
> > > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> > > that are contending QueueImpl lock, but the numbers for each bar are
> very
> > > different: in the previous graph the contention on the shared thread
> pool
> > > lock is of 600 ns, while here is 20-80 ns and it can scale with number
> of
> > > queues, while the previous version not.
> > >
> > > All green right? So, why I've reverted the lock-free thread pool?
> > >
> > > Because with a low utilization of the broker (ie 1 producer/1
> consumer/1
> > > queue) the latencies and throughput were actually worse: cpu
> utilization
> > > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > > spending most of its time by awaking the shared thread pool. The same
> was
> > > happening with a ForkJoin pool, sadly.
> > > It seems (and it is just a guess) that, given that tasks get consumed
> > > faster (there is no lock preventing them to get polled and executed),
> the
> > > thread pool is getting idle sooner (the default thread pool size is of
> 30
> > > and I have a machine with just 8 real cores), forcing any new task
> > > submission to awake any of the thread pool to process incoming tasks.
> > >
> > > What are your thoughts on this?
> > > I don't want to trade so much the "low utilization" performance for the
> > > scaling TBH, that's why I've preferred to revert the change.
> > > Note that other applications with scalability needs (eg Cassandra) have
> > > changed their shared pool approach based on SEDA to a thread-per-pool
> > > architecture for this same reason.
> > >
> > > Cheers,
> > > Franz
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
>
>
>

Reply via email to