Re: Improve paging performance when there are lots of subscribers

michael . andre . pearce Fri, 12 Jul 2019 21:17:25 -0700

Could a squashed PR be sent?




Get Outlook for Android







On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <[email protected]> wrote:










Hi,

I have finished work on the new implementation(not yet tests and
configuration) as suggested by franz.

I put fileOffsetset in the PagePosition and add a new class PageReader
which is a wrapper of the page that implements PageCache interface. The
PageReader class is used to read page file if cache is evicted. For detail,
see
https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987

I deployed some tests and results below:
1. Running in 51MB size page and 1 page cache in the case of 100 multicast
queues.
https://filebin.net/wnyan7d2n1qgfsvg
2. Running in 5MB size page and 100 page cache in the case of 100 multicast
queues.
https://filebin.net/re0989vz7ib1c5mc
3. Running in 51MB size page and 1 page cache in the case of 1 queue.
https://filebin.net/3qndct7f11qckrus

The results seem good, similar with the implementation in the pr. The most
important is the index cache data is removed, no worry about extra overhead
:)

yw yw  于2019年7月4日周四 下午5:38写道：

> Hi,  michael
>
> Thanks for the advise. For the current pr, we can use two arrays where one
> records the message number and the other one corresponding offset to
> optimize the memory usage. For the franz's approch, we will also work on
> a early prototyping implementation. After that, we would take some basic
> tests in different scenarios.
>
>  于2019年7月2日周二 上午7:08写道：
>
>> Point though is an extra index cache layer is needed. The overhead of
>> that means the total paged capacity will be more limited as that overhead
>> isnt just an extra int per reference. E.g. in the pr the current impl isnt
>> very memory optimised, could an int array be used or at worst an open
>> primitive int int hashmap.
>>
>>
>>
>>
>> This is why i really prefer franz's approach.
>>
>>
>>
>>
>> Also what ever we do, we need the new behaviour configurable, so should a
>> use case not thought about they won't be impacted. E.g. the change should
>> not be a surprise, it should be something you toggle on.
>>
>>
>>
>>
>> Get Outlook for Android
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>> We've took a test against your configuration:
>> 5Mb10010Mb.
>> The current code: 7000msg/s sent and 18000msg/s received.
>> Pr code:16000msg/s received and 8200msg/s sent.
>> Like you said, the performance boosts by using much smaller page file and
>> holding many more for current code.
>>
>> Not sure what implications would have using smaller page file, the
>> producer
>> performance may reduce since switching files is more frequent, number of
>> file handle would increase?
>>
>> While our consumer in the test just echos, nothing to do after receiving
>> message, the consumer in the real world may be busy doing business. This
>> means references and page caches reside in memory longer and may be
>> evicted
>> more easily when producers are sending all the time.
>>
>> Since We don't know how many subscribers there are, it is not a scalable
>> approch. We can't reduce page file size unlimited to fit the number of
>> subscribers. The code should accommodate to all kinds of configurations.
>> We
>> adjust configuration for trade off as needed, not work around IMO.
>> In our company, ~200 queues(60% are owned by some addresses) are deployed
>> in the broker. We can't set all to e.g. 100 page caches(too much memory),
>> and neither set different size according to address pattern(hard for
>> operation). In the multi tenants cluster, we prefer availability and to
>> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
>> max size to 31MB. It's running well in one of our clusters now:)
>>
>>  于2019年6月29日周六 上午2:35写道：
>>
>> > I think some of that is down to configuration. If you think you could
>> > configure paging to have much smaller page files but have many more
>> held.
>> > That way the reference sizes will be far smaller and pages dropping in
>> and
>> > out would be less. E.g. if you expect 100 being read make it 100 but
>> make
>> > the page sizes smaller so the overhead is far less
>> >
>> >
>> >
>> >
>> > Get Outlook for Android
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > "At last for one message we maybe read twice: first we read page and
>> create
>> > pagereference; second we requery message after its reference is
>> removed.  "
>> >
>> > I just realized it was wrong. One message maybe read many times. Think
>> of
>> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
>> > reading the whole page; When #2001~#4000 msg is deliverd, need to depage
>> > #4001~#6000 msg, reading page again, etc.
>> >
>> > One message maybe read three times if we don't depage until all messages
>> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
>> which
>> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
>> bigger
>> > than page size), first depage round reads bottom half of p1 and top
>> part of
>> > p2; second depage round reads bottom half of p2 and top part of p3.
>> > Therforce p2 is read twice and m1 maybe read three times if requeryed.
>> >
>> > Be honest, i don't know how to fix the problem above with the
>> > decrentralized approch. The point is not how we rely on os cache, it's
>> that
>> > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
>> > messages. Also there is no need to save 51MB PagedReferenceImpl in
>> memory.
>> > When 100 queues occupy 5100MB memory, the message references are very
>> > likely to be removed.
>> >
>> >
>> > Francesco Nigro  于2019年6月27日周四 下午5:05写道：
>> >
>> > > >
>> > > >  which means the offset info is 100 times large compared to the
>> shared
>> > > > page index cache.
>> > >
>> > >
>> > > I would check with JOL plugin for exact numbers..
>> > > I see with it that we would have an increase of 4 bytes for each
>> > > PagedRefeferenceImpl, totally decrentralized vs
>> > > a centralized approach (the cache). In the economy of a fully loaded
>> > > broker, if we care about scaling need to understand if the memory
>> > tradeoff
>> > > is important enough
>> > > to choose one of the 2 approaches.
>> > > My point is that paging could be made totally based on the OS page
>> cache
>> > if
>> > > GC would get in the middle, deleting any previous mechanism of page
>> > > caching...simplifying the process at it is.
>> > > Using a 2 level cache with such centralized approach can work, but
>> will
>> > add
>> > > a level of complexity that IMO could be saved...
>> > > What do you think could be the benefit of the decentralized solution
>> if
>> > > compared with the one proposed in the PR?
>> > >
>> > >
>> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
>> > > scritto:
>> > >
>> > > > Sorry, I missed the PageReferece part.
>> > > >
>> > > > The lifecyle of PageReference is: depage(in
>> > > intermediateMessageReferences)
>> > > > -> deliver(in messageReferences) -> waiting for ack(in
>> deliveringRefs)
>> > ->
>> > > > removed. Every queue would create it's own PageReference which means
>> > the
>> > > > offset info is 100 times large compared to the shared page index
>> cache.
>> > > > If we keep 51MB pageReference size in memory, as i said in pr, "For
>> > > > multiple subscribers to the same address, just one executor is
>> > > responsible
>> > > > for delivering which means at the same moment only one queue is
>> > > delivering.
>> > > > Thus the queue maybe stalled for a long time. We get queueMemorySize
>> > > > messages into memory, and when we deliver these after a long time,
>> we
>> > > > probably need to query message and read page file again.".  At last
>> for
>> > > one
>> > > > message we maybe read twice: first we read page and create
>> > pagereference;
>> > > > second we requery message after its reference is removed.
>> > > >
>> > > > For the shared page index cache design, each message just need to be
>> > read
>> > > > from file once.
>> > > >
>> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道：
>> > > >
>> > > > > Hi
>> > > > >
>> > > > > First of all i think this is an excellent effort, and could be a
>> > > > potential
>> > > > > massive positive change.
>> > > > >
>> > > > > Before making any change on such scale, i do think we need to
>> ensure
>> > we
>> > > > > have sufficient benchmarks on a number of scenarios, not just one
>> use
>> > > > case,
>> > > > > and the benchmark tool used does need to be available openly so
>> that
>> > > > others
>> > > > > can verify the measures and check on their setups.
>> > > > >
>> > > > > Some additional scenarios i would want/need covering are:
>> > > > >
>> > > > > PageCache set to 5, and all consumers keeping up, but lagging
>> enough
>> > to
>> > > > be
>> > > > > reading from the same 1st page cache, latency and throughput need
>> to
>> > be
>> > > > > measured for all.
>> > > > > PageCache set to 5 and all consumers but one keeping up but
>> lagging
>> > > > enough
>> > > > > to be reading from the same 1st page cahce, but the one is falling
>> > off
>> > > > the
>> > > > > end, causing the page cache swapping, measure latecy and
>> througput of
>> > > > those
>> > > > > keeping up in the 1st page cache not caring for the one.
>> > > > >
>> > > > > Regards to solution some alternative approach to discuss
>> > > > >
>> > > > > In your scenario if i understand correctly each subscriber is
>> > > effectivly
>> > > > > having their own queue (1 to 1 mapping) not sharing.
>> > > > > You mention kafka and say multiple consumers doent read serailly
>> on
>> > the
>> > > > > address and this is true, but per queue processing through
>> messages
>> > > > > (dispatch) is still serial even with multiple shared consumers on
>> a
>> > > > queue.
>> > > > >
>> > > > > What about keeping the existing mechanism but having a queue hold
>> > > > reference
>> > > > > to a page cache that the queue is currently on, being kept from gc
>> > > (e.g.
>> > > > > not soft) therefore meaning page cache isnt being swapped around,
>> > when
>> > > > you
>> > > > > have queues (in your case subscribers) swapping pagecaches back
>> and
>> > > forth
>> > > > > avoidning the constant re-read issue.
>> > > > >
>> > > > > Also i think Franz had an excellent idea, do away with pagecache
>> in
>> > its
>> > > > > current form entirely, ensure the offset is kept with the
>> reference
>> > and
>> > > > > rely on OS caching keeping hot blocks/data.
>> > > > >
>> > > > > Best
>> > > > > Michael
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
>> > > > >
>> > > > > > Hi, folks
>> > > > > >
>> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
>> > > degradation
>> > > > > > when there are a lot of subscribers".
>> > > > > >
>> > > > > > First apologize i didn't clarify our thoughts.
>> > > > > >
>> > > > > > As noted in the part of Environment, page-max-cache-size is set
>> to
>> > 1
>> > > > > > meaning at most one page is allowed in softValueCache. We have
>> > tested
>> > > > > with
>> > > > > > the default page-max-cache-size which is 5, it would take some
>> time
>> > > to
>> > > > > > see the performance degradation since at start the cursor
>> positions
>> > > of
>> > > > > 100
>> > > > > > subscribers are similar when all the messages read hits the
>> > > > > softValueCache.
>> > > > > > But after some time, the cursor positions are different. When
>> these
>> > > > > > positions are located more than 5 pages, it means some page
>> would
>> > be
>> > > > read
>> > > > > > back and forth. This can be proved by the trace log "adding
>> > pageCache
>> > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
>> > where
>> > > > some
>> > > > > > pages are read a lot of times for the same subscriber. From the
>> > time
>> > > > on,
>> > > > > > the performance starts to degrade. So we set page-max-cache-size
>> > to 1
>> > > > > > here just to make the test process more fast and it doesn't
>> change
>> > > the
>> > > > > > final result.
>> > > > > >
>> > > > > > The softValueCache would be removed if memory is really low, in
>> > > > addition
>> > > > > > the map size reaches capacity(default 5). In most cases, the
>> > > > subscribers
>> > > > > > are tailing read which are served by softValueCache(no need to
>> > bother
>> > > > > > disk), thus we need to keep it. But When some subscribers fall
>> > > behind,
>> > > > > they
>> > > > > > need to read page not in softValueCache. After looking up code,
>> we
>> > > > found
>> > > > > one
>> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
>> > round
>> > > > in
>> > > > > > most situations, and that's to say at most
>> MAX_DELIVERIES_IN_LOOP *
>> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next.
>> If
>> > > you
>> > > > > > adjust QueueImpl logger to debug level, you would see logs like
>> > > "Queue
>> > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize
>> =
>> > > > > 52428800.
>> > > > > > Depaged 68 messages, pendingDelivery=1002,
>> > > > intermediateMessageReferences=
>> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
>> > > messages,
>> > > > > > each subscriber has to read a whole page which is unnecessary
>> and
>> > > > > wasteful.
>> > > > > > In our test where one page(50MB) contains ~40000 messages, one
>> > > > subscriber
>> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
>> evicted
>> > > to
>> > > > > > finish delivering it. This has drastically slowed down the
>> process
>> > > and
>> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read
>> one
>> > > > > message
>> > > > > > each time rather than read all messages of page. In this way,
>> for
>> > > each
>> > > > > > subscriber each page is read only once after finishing
>> delivering.
>> > > > > >
>> > > > > > Having said that, the softValueCache is used for tailing read.
>> If
>> > > it's
>> > > > > > evicted, it won't be reloaded to prevent from the issue
>> illustrated
>> > > > > above.
>> > > > > > Instead the pageIndexCache would be used.
>> > > > > >
>> > > > > > Regarding implementation details, we noted that before
>> delivering
>> > > > page, a
>> > > > > > pageCursorInfo is constructed which needs to read the whole
>> page.
>> > We
>> > > > can
>> > > > > > take this opportunity to construct the pageIndexCache. It's very
>> > > simple
>> > > > > to
>> > > > > > code. We also think of building a offset index file and some
>> > concerns
>> > > > > > stemed from following:
>> > > > > >
>> > > > > >    1. When to write and sync index file? Would it have some
>> > > performance
>> > > > > >    implications?
>> > > > > >    2. If we have a index file, we can construct pageCursorInfo
>> > > through
>> > > > > >    it(no need to read the page like before), but we need to
>> write
>> > the
>> > > > > total
>> > > > > >    message number into it first. Seems a little weird putting
>> this
>> > > into
>> > > > > the
>> > > > > >    index file.
>> > > > > >    3. If experiencing hard crash, a recover mechanism would be
>> > needed
>> > > > to
>> > > > > >    recover page and page index files, E.g. truncating to the
>> valid
>> > > > size.
>> > > > > So
>> > > > > >    how do we know which files need to be sanity checked?
>> > > > > >    4. A variant binary search algorithm maybe needed, see
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
>> > > > > >     .
>> > > > > >    5. Unlike kafka from which user fetches lots of messages at
>> once
>> > > and
>> > > > > >    broker just needs to look up start offset from the index file
>> > > once,
>> > > > > artemis
>> > > > > >    delivers message one by one and that means we have to look up
>> > the
>> > > > > index
>> > > > > >    every time we deliver a message. Although the index file is
>> > > possibly
>> > > > > in
>> > > > > >    page cache, there are still chances we miss cache.
>> > > > > >    6. Compatibility with old files.
>> > > > > >
>> > > > > > To sum that, kafka uses a mmaped index file and we use a index
>> > cache.
>> > > > > Both
>> > > > > > are designed to find physical file position according
>> offset(kafka)
>> > > or
>> > > > > > message number(artemis). And we prefer the index cache bcs it's
>> > easy
>> > > to
>> > > > > > understand and maintain.
>> > > > > >
>> > > > > > We also tested the one subscriber case with the same setup.
>> > > > > > The original:
>> > > > > > consumer tps(11000msg/s) and latency:
>> > > > > > [image: orig_single_subscriber.png]
>> > > > > > producer tps(30000msg/s) and latency:
>> > > > > > [image: orig_single_producer.png]
>> > > > > > The pr:
>> > > > > > consumer tps(14000msg/s) and latency:
>> > > > > > [image: pr_single_consumer.png]
>> > > > > > producer tps(30000msg/s) and latency:
>> > > > > > [image: pr_single_producer.png]
>> > > > > > It showed result is similar and event a little better in the
>> case
>> > of
>> > > > > > single subscriber.
>> > > > > >
>> > > > > > We used our inner test platform and i think jmeter can also be
>> used
>> > > to
>> > > > > > test again it.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
>>

Re: Improve paging performance when there are lots of subscribers

Reply via email to