Could a squashed PR be sent?
Get Outlook for Android On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <[email protected]> wrote: Hi, I have finished work on the new implementation(not yet tests and configuration) as suggested by franz. I put fileOffsetset in the PagePosition and add a new class PageReader which is a wrapper of the page that implements PageCache interface. The PageReader class is used to read page file if cache is evicted. For detail, see https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987 I deployed some tests and results below: 1. Running in 51MB size page and 1 page cache in the case of 100 multicast queues. https://filebin.net/wnyan7d2n1qgfsvg 2. Running in 5MB size page and 100 page cache in the case of 100 multicast queues. https://filebin.net/re0989vz7ib1c5mc 3. Running in 51MB size page and 1 page cache in the case of 1 queue. https://filebin.net/3qndct7f11qckrus The results seem good, similar with the implementation in the pr. The most important is the index cache data is removed, no worry about extra overhead :) yw yw 于2019年7月4日周四 下午5:38写道: > Hi, michael > > Thanks for the advise. For the current pr, we can use two arrays where one > records the message number and the other one corresponding offset to > optimize the memory usage. For the franz's approch, we will also work on > a early prototyping implementation. After that, we would take some basic > tests in different scenarios. > > 于2019年7月2日周二 上午7:08写道: > >> Point though is an extra index cache layer is needed. The overhead of >> that means the total paged capacity will be more limited as that overhead >> isnt just an extra int per reference. E.g. in the pr the current impl isnt >> very memory optimised, could an int array be used or at worst an open >> primitive int int hashmap. >> >> >> >> >> This is why i really prefer franz's approach. >> >> >> >> >> Also what ever we do, we need the new behaviour configurable, so should a >> use case not thought about they won't be impacted. E.g. the change should >> not be a surprise, it should be something you toggle on. >> >> >> >> >> Get Outlook for Android >> >> >> >> >> >> >> >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw" wrote: >> >> >> >> >> >> >> >> >> >> >> Hi, >> We've took a test against your configuration: >> 5Mb10010Mb. >> The current code: 7000msg/s sent and 18000msg/s received. >> Pr code:16000msg/s received and 8200msg/s sent. >> Like you said, the performance boosts by using much smaller page file and >> holding many more for current code. >> >> Not sure what implications would have using smaller page file, the >> producer >> performance may reduce since switching files is more frequent, number of >> file handle would increase? >> >> While our consumer in the test just echos, nothing to do after receiving >> message, the consumer in the real world may be busy doing business. This >> means references and page caches reside in memory longer and may be >> evicted >> more easily when producers are sending all the time. >> >> Since We don't know how many subscribers there are, it is not a scalable >> approch. We can't reduce page file size unlimited to fit the number of >> subscribers. The code should accommodate to all kinds of configurations. >> We >> adjust configuration for trade off as needed, not work around IMO. >> In our company, ~200 queues(60% are owned by some addresses) are deployed >> in the broker. We can't set all to e.g. 100 page caches(too much memory), >> and neither set different size according to address pattern(hard for >> operation). In the multi tenants cluster, we prefer availability and to >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and >> max size to 31MB. It's running well in one of our clusters now:) >> >> 于2019年6月29日周六 上午2:35写道: >> >> > I think some of that is down to configuration. If you think you could >> > configure paging to have much smaller page files but have many more >> held. >> > That way the reference sizes will be far smaller and pages dropping in >> and >> > out would be less. E.g. if you expect 100 being read make it 100 but >> make >> > the page sizes smaller so the overhead is far less >> > >> > >> > >> > >> > Get Outlook for Android >> > >> > >> > >> > >> > >> > >> > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw" wrote: >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > "At last for one message we maybe read twice: first we read page and >> create >> > pagereference; second we requery message after its reference is >> removed. " >> > >> > I just realized it was wrong. One message maybe read many times. Think >> of >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg, >> > reading the whole page; When #2001~#4000 msg is deliverd, need to depage >> > #4001~#6000 msg, reading page again, etc. >> > >> > One message maybe read three times if we don't depage until all messages >> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1 >> which >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little >> bigger >> > than page size), first depage round reads bottom half of p1 and top >> part of >> > p2; second depage round reads bottom half of p2 and top part of p3. >> > Therforce p2 is read twice and m1 maybe read three times if requeryed. >> > >> > Be honest, i don't know how to fix the problem above with the >> > decrentralized approch. The point is not how we rely on os cache, it's >> that >> > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000 >> > messages. Also there is no need to save 51MB PagedReferenceImpl in >> memory. >> > When 100 queues occupy 5100MB memory, the message references are very >> > likely to be removed. >> > >> > >> > Francesco Nigro 于2019年6月27日周四 下午5:05写道: >> > >> > > > >> > > > which means the offset info is 100 times large compared to the >> shared >> > > > page index cache. >> > > >> > > >> > > I would check with JOL plugin for exact numbers.. >> > > I see with it that we would have an increase of 4 bytes for each >> > > PagedRefeferenceImpl, totally decrentralized vs >> > > a centralized approach (the cache). In the economy of a fully loaded >> > > broker, if we care about scaling need to understand if the memory >> > tradeoff >> > > is important enough >> > > to choose one of the 2 approaches. >> > > My point is that paging could be made totally based on the OS page >> cache >> > if >> > > GC would get in the middle, deleting any previous mechanism of page >> > > caching...simplifying the process at it is. >> > > Using a 2 level cache with such centralized approach can work, but >> will >> > add >> > > a level of complexity that IMO could be saved... >> > > What do you think could be the benefit of the decentralized solution >> if >> > > compared with the one proposed in the PR? >> > > >> > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw ha >> > > scritto: >> > > >> > > > Sorry, I missed the PageReferece part. >> > > > >> > > > The lifecyle of PageReference is: depage(in >> > > intermediateMessageReferences) >> > > > -> deliver(in messageReferences) -> waiting for ack(in >> deliveringRefs) >> > -> >> > > > removed. Every queue would create it's own PageReference which means >> > the >> > > > offset info is 100 times large compared to the shared page index >> cache. >> > > > If we keep 51MB pageReference size in memory, as i said in pr, "For >> > > > multiple subscribers to the same address, just one executor is >> > > responsible >> > > > for delivering which means at the same moment only one queue is >> > > delivering. >> > > > Thus the queue maybe stalled for a long time. We get queueMemorySize >> > > > messages into memory, and when we deliver these after a long time, >> we >> > > > probably need to query message and read page file again.". At last >> for >> > > one >> > > > message we maybe read twice: first we read page and create >> > pagereference; >> > > > second we requery message after its reference is removed. >> > > > >> > > > For the shared page index cache design, each message just need to be >> > read >> > > > from file once. >> > > > >> > > > Michael Pearce 于2019年6月27日周四 下午3:03写道: >> > > > >> > > > > Hi >> > > > > >> > > > > First of all i think this is an excellent effort, and could be a >> > > > potential >> > > > > massive positive change. >> > > > > >> > > > > Before making any change on such scale, i do think we need to >> ensure >> > we >> > > > > have sufficient benchmarks on a number of scenarios, not just one >> use >> > > > case, >> > > > > and the benchmark tool used does need to be available openly so >> that >> > > > others >> > > > > can verify the measures and check on their setups. >> > > > > >> > > > > Some additional scenarios i would want/need covering are: >> > > > > >> > > > > PageCache set to 5, and all consumers keeping up, but lagging >> enough >> > to >> > > > be >> > > > > reading from the same 1st page cache, latency and throughput need >> to >> > be >> > > > > measured for all. >> > > > > PageCache set to 5 and all consumers but one keeping up but >> lagging >> > > > enough >> > > > > to be reading from the same 1st page cahce, but the one is falling >> > off >> > > > the >> > > > > end, causing the page cache swapping, measure latecy and >> througput of >> > > > those >> > > > > keeping up in the 1st page cache not caring for the one. >> > > > > >> > > > > Regards to solution some alternative approach to discuss >> > > > > >> > > > > In your scenario if i understand correctly each subscriber is >> > > effectivly >> > > > > having their own queue (1 to 1 mapping) not sharing. >> > > > > You mention kafka and say multiple consumers doent read serailly >> on >> > the >> > > > > address and this is true, but per queue processing through >> messages >> > > > > (dispatch) is still serial even with multiple shared consumers on >> a >> > > > queue. >> > > > > >> > > > > What about keeping the existing mechanism but having a queue hold >> > > > reference >> > > > > to a page cache that the queue is currently on, being kept from gc >> > > (e.g. >> > > > > not soft) therefore meaning page cache isnt being swapped around, >> > when >> > > > you >> > > > > have queues (in your case subscribers) swapping pagecaches back >> and >> > > forth >> > > > > avoidning the constant re-read issue. >> > > > > >> > > > > Also i think Franz had an excellent idea, do away with pagecache >> in >> > its >> > > > > current form entirely, ensure the offset is kept with the >> reference >> > and >> > > > > rely on OS caching keeping hot blocks/data. >> > > > > >> > > > > Best >> > > > > Michael >> > > > > >> > > > > >> > > > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw wrote: >> > > > > >> > > > > > Hi, folks >> > > > > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance >> > > degradation >> > > > > > when there are a lot of subscribers". >> > > > > > >> > > > > > First apologize i didn't clarify our thoughts. >> > > > > > >> > > > > > As noted in the part of Environment, page-max-cache-size is set >> to >> > 1 >> > > > > > meaning at most one page is allowed in softValueCache. We have >> > tested >> > > > > with >> > > > > > the default page-max-cache-size which is 5, it would take some >> time >> > > to >> > > > > > see the performance degradation since at start the cursor >> positions >> > > of >> > > > > 100 >> > > > > > subscribers are similar when all the messages read hits the >> > > > > softValueCache. >> > > > > > But after some time, the cursor positions are different. When >> these >> > > > > > positions are located more than 5 pages, it means some page >> would >> > be >> > > > read >> > > > > > back and forth. This can be proved by the trace log "adding >> > pageCache >> > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl >> > where >> > > > some >> > > > > > pages are read a lot of times for the same subscriber. From the >> > time >> > > > on, >> > > > > > the performance starts to degrade. So we set page-max-cache-size >> > to 1 >> > > > > > here just to make the test process more fast and it doesn't >> change >> > > the >> > > > > > final result. >> > > > > > >> > > > > > The softValueCache would be removed if memory is really low, in >> > > > addition >> > > > > > the map size reaches capacity(default 5). In most cases, the >> > > > subscribers >> > > > > > are tailing read which are served by softValueCache(no need to >> > bother >> > > > > > disk), thus we need to keep it. But When some subscribers fall >> > > behind, >> > > > > they >> > > > > > need to read page not in softValueCache. After looking up code, >> we >> > > > found >> > > > > one >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver >> > round >> > > > in >> > > > > > most situations, and that's to say at most >> MAX_DELIVERIES_IN_LOOP * >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. >> If >> > > you >> > > > > > adjust QueueImpl logger to debug level, you would see logs like >> > > "Queue >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize >> = >> > > > > 52428800. >> > > > > > Depaged 68 messages, pendingDelivery=1002, >> > > > intermediateMessageReferences= >> > > > > > 23162, queueDelivering=0". In order to depage less than 2000 >> > > messages, >> > > > > > each subscriber has to read a whole page which is unnecessary >> and >> > > > > wasteful. >> > > > > > In our test where one page(50MB) contains ~40000 messages, one >> > > > subscriber >> > > > > > maybe read 40000/2000=20 times of page if softValueCache is >> evicted >> > > to >> > > > > > finish delivering it. This has drastically slowed down the >> process >> > > and >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read >> one >> > > > > message >> > > > > > each time rather than read all messages of page. In this way, >> for >> > > each >> > > > > > subscriber each page is read only once after finishing >> delivering. >> > > > > > >> > > > > > Having said that, the softValueCache is used for tailing read. >> If >> > > it's >> > > > > > evicted, it won't be reloaded to prevent from the issue >> illustrated >> > > > > above. >> > > > > > Instead the pageIndexCache would be used. >> > > > > > >> > > > > > Regarding implementation details, we noted that before >> delivering >> > > > page, a >> > > > > > pageCursorInfo is constructed which needs to read the whole >> page. >> > We >> > > > can >> > > > > > take this opportunity to construct the pageIndexCache. It's very >> > > simple >> > > > > to >> > > > > > code. We also think of building a offset index file and some >> > concerns >> > > > > > stemed from following: >> > > > > > >> > > > > > 1. When to write and sync index file? Would it have some >> > > performance >> > > > > > implications? >> > > > > > 2. If we have a index file, we can construct pageCursorInfo >> > > through >> > > > > > it(no need to read the page like before), but we need to >> write >> > the >> > > > > total >> > > > > > message number into it first. Seems a little weird putting >> this >> > > into >> > > > > the >> > > > > > index file. >> > > > > > 3. If experiencing hard crash, a recover mechanism would be >> > needed >> > > > to >> > > > > > recover page and page index files, E.g. truncating to the >> valid >> > > > size. >> > > > > So >> > > > > > how do we know which files need to be sanity checked? >> > > > > > 4. A variant binary search algorithm maybe needed, see >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala >> > > > > > . >> > > > > > 5. Unlike kafka from which user fetches lots of messages at >> once >> > > and >> > > > > > broker just needs to look up start offset from the index file >> > > once, >> > > > > artemis >> > > > > > delivers message one by one and that means we have to look up >> > the >> > > > > index >> > > > > > every time we deliver a message. Although the index file is >> > > possibly >> > > > > in >> > > > > > page cache, there are still chances we miss cache. >> > > > > > 6. Compatibility with old files. >> > > > > > >> > > > > > To sum that, kafka uses a mmaped index file and we use a index >> > cache. >> > > > > Both >> > > > > > are designed to find physical file position according >> offset(kafka) >> > > or >> > > > > > message number(artemis). And we prefer the index cache bcs it's >> > easy >> > > to >> > > > > > understand and maintain. >> > > > > > >> > > > > > We also tested the one subscriber case with the same setup. >> > > > > > The original: >> > > > > > consumer tps(11000msg/s) and latency: >> > > > > > [image: orig_single_subscriber.png] >> > > > > > producer tps(30000msg/s) and latency: >> > > > > > [image: orig_single_producer.png] >> > > > > > The pr: >> > > > > > consumer tps(14000msg/s) and latency: >> > > > > > [image: pr_single_consumer.png] >> > > > > > producer tps(30000msg/s) and latency: >> > > > > > [image: pr_single_producer.png] >> > > > > > It showed result is similar and event a little better in the >> case >> > of >> > > > > > single subscriber. >> > > > > > >> > > > > > We used our inner test platform and i think jmeter can also be >> used >> > > to >> > > > > > test again it. >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > >> > >> > >> >> >> >> >> >>
