Hi, michael Thanks for the advise. For the current pr, we can use two arrays where one records the message number and the other one corresponding offset to optimize the memory usage. For the franz's approch, we will also work on a early prototyping implementation. After that, we would take some basic tests in different scenarios.
<[email protected]> 于2019年7月2日周二 上午7:08写道: > Point though is an extra index cache layer is needed. The overhead of that > means the total paged capacity will be more limited as that overhead isnt > just an extra int per reference. E.g. in the pr the current impl isnt very > memory optimised, could an int array be used or at worst an open primitive > int int hashmap. > > > > > This is why i really prefer franz's approach. > > > > > Also what ever we do, we need the new behaviour configurable, so should a > use case not thought about they won't be impacted. E.g. the change should > not be a surprise, it should be something you toggle on. > > > > > Get Outlook for Android > > > > > > > > On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw" <[email protected]> wrote: > > > > > > > > > > > Hi, > We've took a test against your configuration: > 5Mb10010Mb. > The current code: 7000msg/s sent and 18000msg/s received. > Pr code:16000msg/s received and 8200msg/s sent. > Like you said, the performance boosts by using much smaller page file and > holding many more for current code. > > Not sure what implications would have using smaller page file, the producer > performance may reduce since switching files is more frequent, number of > file handle would increase? > > While our consumer in the test just echos, nothing to do after receiving > message, the consumer in the real world may be busy doing business. This > means references and page caches reside in memory longer and may be evicted > more easily when producers are sending all the time. > > Since We don't know how many subscribers there are, it is not a scalable > approch. We can't reduce page file size unlimited to fit the number of > subscribers. The code should accommodate to all kinds of configurations. We > adjust configuration for trade off as needed, not work around IMO. > In our company, ~200 queues(60% are owned by some addresses) are deployed > in the broker. We can't set all to e.g. 100 page caches(too much memory), > and neither set different size according to address pattern(hard for > operation). In the multi tenants cluster, we prefer availability and to > avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and > max size to 31MB. It's running well in one of our clusters now:) > > 于2019年6月29日周六 上午2:35写道: > > > I think some of that is down to configuration. If you think you could > > configure paging to have much smaller page files but have many more held. > > That way the reference sizes will be far smaller and pages dropping in > and > > out would be less. E.g. if you expect 100 being read make it 100 but make > > the page sizes smaller so the overhead is far less > > > > > > > > > > Get Outlook for Android > > > > > > > > > > > > > > > > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw" wrote: > > > > > > > > > > > > > > > > > > > > > > "At last for one message we maybe read twice: first we read page and > create > > pagereference; second we requery message after its reference is > removed. " > > > > I just realized it was wrong. One message maybe read many times. Think of > > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg, > > reading the whole page; When #2001~#4000 msg is deliverd, need to depage > > #4001~#6000 msg, reading page again, etc. > > > > One message maybe read three times if we don't depage until all messages > > are delivered. For example, we have 3 pages p1, p2,p3 and message m1 > which > > is at top part of the p2. In our case(max-size-bytes=51MB, a little > bigger > > than page size), first depage round reads bottom half of p1 and top part > of > > p2; second depage round reads bottom half of p2 and top part of p3. > > Therforce p2 is read twice and m1 maybe read three times if requeryed. > > > > Be honest, i don't know how to fix the problem above with the > > decrentralized approch. The point is not how we rely on os cache, it's > that > > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000 > > messages. Also there is no need to save 51MB PagedReferenceImpl in > memory. > > When 100 queues occupy 5100MB memory, the message references are very > > likely to be removed. > > > > > > Francesco Nigro 于2019年6月27日周四 下午5:05写道: > > > > > > > > > > which means the offset info is 100 times large compared to the > shared > > > > page index cache. > > > > > > > > > I would check with JOL plugin for exact numbers.. > > > I see with it that we would have an increase of 4 bytes for each > > > PagedRefeferenceImpl, totally decrentralized vs > > > a centralized approach (the cache). In the economy of a fully loaded > > > broker, if we care about scaling need to understand if the memory > > tradeoff > > > is important enough > > > to choose one of the 2 approaches. > > > My point is that paging could be made totally based on the OS page > cache > > if > > > GC would get in the middle, deleting any previous mechanism of page > > > caching...simplifying the process at it is. > > > Using a 2 level cache with such centralized approach can work, but will > > add > > > a level of complexity that IMO could be saved... > > > What do you think could be the benefit of the decentralized solution if > > > compared with the one proposed in the PR? > > > > > > > > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw ha > > > scritto: > > > > > > > Sorry, I missed the PageReferece part. > > > > > > > > The lifecyle of PageReference is: depage(in > > > intermediateMessageReferences) > > > > -> deliver(in messageReferences) -> waiting for ack(in > deliveringRefs) > > -> > > > > removed. Every queue would create it's own PageReference which means > > the > > > > offset info is 100 times large compared to the shared page index > cache. > > > > If we keep 51MB pageReference size in memory, as i said in pr, "For > > > > multiple subscribers to the same address, just one executor is > > > responsible > > > > for delivering which means at the same moment only one queue is > > > delivering. > > > > Thus the queue maybe stalled for a long time. We get queueMemorySize > > > > messages into memory, and when we deliver these after a long time, we > > > > probably need to query message and read page file again.". At last > for > > > one > > > > message we maybe read twice: first we read page and create > > pagereference; > > > > second we requery message after its reference is removed. > > > > > > > > For the shared page index cache design, each message just need to be > > read > > > > from file once. > > > > > > > > Michael Pearce 于2019年6月27日周四 下午3:03写道: > > > > > > > > > Hi > > > > > > > > > > First of all i think this is an excellent effort, and could be a > > > > potential > > > > > massive positive change. > > > > > > > > > > Before making any change on such scale, i do think we need to > ensure > > we > > > > > have sufficient benchmarks on a number of scenarios, not just one > use > > > > case, > > > > > and the benchmark tool used does need to be available openly so > that > > > > others > > > > > can verify the measures and check on their setups. > > > > > > > > > > Some additional scenarios i would want/need covering are: > > > > > > > > > > PageCache set to 5, and all consumers keeping up, but lagging > enough > > to > > > > be > > > > > reading from the same 1st page cache, latency and throughput need > to > > be > > > > > measured for all. > > > > > PageCache set to 5 and all consumers but one keeping up but lagging > > > > enough > > > > > to be reading from the same 1st page cahce, but the one is falling > > off > > > > the > > > > > end, causing the page cache swapping, measure latecy and througput > of > > > > those > > > > > keeping up in the 1st page cache not caring for the one. > > > > > > > > > > Regards to solution some alternative approach to discuss > > > > > > > > > > In your scenario if i understand correctly each subscriber is > > > effectivly > > > > > having their own queue (1 to 1 mapping) not sharing. > > > > > You mention kafka and say multiple consumers doent read serailly on > > the > > > > > address and this is true, but per queue processing through messages > > > > > (dispatch) is still serial even with multiple shared consumers on a > > > > queue. > > > > > > > > > > What about keeping the existing mechanism but having a queue hold > > > > reference > > > > > to a page cache that the queue is currently on, being kept from gc > > > (e.g. > > > > > not soft) therefore meaning page cache isnt being swapped around, > > when > > > > you > > > > > have queues (in your case subscribers) swapping pagecaches back and > > > forth > > > > > avoidning the constant re-read issue. > > > > > > > > > > Also i think Franz had an excellent idea, do away with pagecache in > > its > > > > > current form entirely, ensure the offset is kept with the reference > > and > > > > > rely on OS caching keeping hot blocks/data. > > > > > > > > > > Best > > > > > Michael > > > > > > > > > > > > > > > > > > > > On Thu, 27 Jun 2019 at 05:13, yw yw wrote: > > > > > > > > > > > Hi, folks > > > > > > > > > > > > This is the discussion about "ARTEMIS-2399 Fix performance > > > degradation > > > > > > when there are a lot of subscribers". > > > > > > > > > > > > First apologize i didn't clarify our thoughts. > > > > > > > > > > > > As noted in the part of Environment, page-max-cache-size is set > to > > 1 > > > > > > meaning at most one page is allowed in softValueCache. We have > > tested > > > > > with > > > > > > the default page-max-cache-size which is 5, it would take some > time > > > to > > > > > > see the performance degradation since at start the cursor > positions > > > of > > > > > 100 > > > > > > subscribers are similar when all the messages read hits the > > > > > softValueCache. > > > > > > But after some time, the cursor positions are different. When > these > > > > > > positions are located more than 5 pages, it means some page would > > be > > > > read > > > > > > back and forth. This can be proved by the trace log "adding > > pageCache > > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl > > where > > > > some > > > > > > pages are read a lot of times for the same subscriber. From the > > time > > > > on, > > > > > > the performance starts to degrade. So we set page-max-cache-size > > to 1 > > > > > > here just to make the test process more fast and it doesn't > change > > > the > > > > > > final result. > > > > > > > > > > > > The softValueCache would be removed if memory is really low, in > > > > addition > > > > > > the map size reaches capacity(default 5). In most cases, the > > > > subscribers > > > > > > are tailing read which are served by softValueCache(no need to > > bother > > > > > > disk), thus we need to keep it. But When some subscribers fall > > > behind, > > > > > they > > > > > > need to read page not in softValueCache. After looking up code, > we > > > > found > > > > > one > > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver > > round > > > > in > > > > > > most situations, and that's to say at most > MAX_DELIVERIES_IN_LOOP * > > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. > If > > > you > > > > > > adjust QueueImpl logger to debug level, you would see logs like > > > "Queue > > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize = > > > > > 52428800. > > > > > > Depaged 68 messages, pendingDelivery=1002, > > > > intermediateMessageReferences= > > > > > > 23162, queueDelivering=0". In order to depage less than 2000 > > > messages, > > > > > > each subscriber has to read a whole page which is unnecessary and > > > > > wasteful. > > > > > > In our test where one page(50MB) contains ~40000 messages, one > > > > subscriber > > > > > > maybe read 40000/2000=20 times of page if softValueCache is > evicted > > > to > > > > > > finish delivering it. This has drastically slowed down the > process > > > and > > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read > one > > > > > message > > > > > > each time rather than read all messages of page. In this way, for > > > each > > > > > > subscriber each page is read only once after finishing > delivering. > > > > > > > > > > > > Having said that, the softValueCache is used for tailing read. If > > > it's > > > > > > evicted, it won't be reloaded to prevent from the issue > illustrated > > > > > above. > > > > > > Instead the pageIndexCache would be used. > > > > > > > > > > > > Regarding implementation details, we noted that before delivering > > > > page, a > > > > > > pageCursorInfo is constructed which needs to read the whole page. > > We > > > > can > > > > > > take this opportunity to construct the pageIndexCache. It's very > > > simple > > > > > to > > > > > > code. We also think of building a offset index file and some > > concerns > > > > > > stemed from following: > > > > > > > > > > > > 1. When to write and sync index file? Would it have some > > > performance > > > > > > implications? > > > > > > 2. If we have a index file, we can construct pageCursorInfo > > > through > > > > > > it(no need to read the page like before), but we need to write > > the > > > > > total > > > > > > message number into it first. Seems a little weird putting > this > > > into > > > > > the > > > > > > index file. > > > > > > 3. If experiencing hard crash, a recover mechanism would be > > needed > > > > to > > > > > > recover page and page index files, E.g. truncating to the > valid > > > > size. > > > > > So > > > > > > how do we know which files need to be sanity checked? > > > > > > 4. A variant binary search algorithm maybe needed, see > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala > > > > > > . > > > > > > 5. Unlike kafka from which user fetches lots of messages at > once > > > and > > > > > > broker just needs to look up start offset from the index file > > > once, > > > > > artemis > > > > > > delivers message one by one and that means we have to look up > > the > > > > > index > > > > > > every time we deliver a message. Although the index file is > > > possibly > > > > > in > > > > > > page cache, there are still chances we miss cache. > > > > > > 6. Compatibility with old files. > > > > > > > > > > > > To sum that, kafka uses a mmaped index file and we use a index > > cache. > > > > > Both > > > > > > are designed to find physical file position according > offset(kafka) > > > or > > > > > > message number(artemis). And we prefer the index cache bcs it's > > easy > > > to > > > > > > understand and maintain. > > > > > > > > > > > > We also tested the one subscriber case with the same setup. > > > > > > The original: > > > > > > consumer tps(11000msg/s) and latency: > > > > > > [image: orig_single_subscriber.png] > > > > > > producer tps(30000msg/s) and latency: > > > > > > [image: orig_single_producer.png] > > > > > > The pr: > > > > > > consumer tps(14000msg/s) and latency: > > > > > > [image: pr_single_consumer.png] > > > > > > producer tps(30000msg/s) and latency: > > > > > > [image: pr_single_producer.png] > > > > > > It showed result is similar and event a little better in the case > > of > > > > > > single subscriber. > > > > > > > > > > > > We used our inner test platform and i think jmeter can also be > used > > > to > > > > > > test again it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
