+1 for having one per queue. Def a better idea than having to hold a cache.
Get Outlook for Android On Fri, Jul 19, 2019 at 4:37 AM +0100, "Clebert Suconic" <[email protected]> wrote: But the real problem here will be the number of openFiles. Each Page will have an Open File, what will keep a lot of open files on the system. Correct? I believe the impact of having the files moving to the Subscription wouldn't be that much, and we would fix the problem. WE wouldn't need a cache at all, as we just keep the File we need at the current cursor. On Tue, Jul 16, 2019 at 10:40 PM yw yw wrote: > > I did consider the case where all pages are instantiated as PageReaders. > That's really a problem. > > The pros of pr is every page is read only once to build PageReader and > shared by all the queues. The cons of pr is many PageReaders are probably > instantiated if consumers make slow/no progress in several queues whereas > fast in other queues(I think it's the only cause leading to the corner > case, right?). This means too many open files and too much memory. > > The pros of duplicated PageReader is there are fixed number of PageReaders > as with queues at the same time. > The cons is each queue has to read the page once to build their own > PageReader if page cache is evicted. I'm not sure how this will affect > performance. > > The point is we need the number of messages in the page which is used by > PageCursorInfo and PageSubscription::internalGetNext, so we have to read > the page file. How about we only cache the number of messages in each page > instead of PageReader and build PageReader in each queue. While we > encounter the corner case, only pair data is permanently in > memory that I assume is smaller than completed PageCursorInfo data. This > way we achieve the performance gain at a small price. > > Clebert Suconic 于2019年7月16日周二 下午10:18写道: > > > I just came back after a 2 weeks deserved break and I was looking at > > this.. and I can say. it's well done.. nice job! it's a lot simpler! > > > > However there's one question now. which is probably a further > > improvement. Shouldn't the pageReader be instantiated at the > > PageSubscription. > > > > That means.. if there's no page cache, in case of the page been > > evicted, the Subscription would then create a new Page/PageReader > > pair. and dispose it when it's done (meaning, moved to a different > > page). > > > > As you are solving the case with many subscriptions, wouldn't you hit > > a corner case where all Pages are instantiated as PageReaders? > > > > > > I feel like it would be better to eventually duplicate a PageReader > > and close it when done. > > > > > > Or did you already consider that possibility and still think it's best > > to keep this cache of PageReaders? > > > > On Sat, Jul 13, 2019 at 12:15 AM > > wrote: > > > > > > Could a squashed PR be sent? > > > > > > > > > > > > > > > Get Outlook for Android > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > I have finished work on the new implementation(not yet tests and > > > configuration) as suggested by franz. > > > > > > I put fileOffsetset in the PagePosition and add a new class PageReader > > > which is a wrapper of the page that implements PageCache interface. The > > > PageReader class is used to read page file if cache is evicted. For > > detail, > > > see > > > > > https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987 > > > > > > I deployed some tests and results below: > > > 1. Running in 51MB size page and 1 page cache in the case of 100 > > multicast > > > queues. > > > https://filebin.net/wnyan7d2n1qgfsvg > > > 2. Running in 5MB size page and 100 page cache in the case of 100 > > multicast > > > queues. > > > https://filebin.net/re0989vz7ib1c5mc > > > 3. Running in 51MB size page and 1 page cache in the case of 1 queue. > > > https://filebin.net/3qndct7f11qckrus > > > > > > The results seem good, similar with the implementation in the pr. The > > most > > > important is the index cache data is removed, no worry about extra > > overhead > > > :) > > > > > > yw yw 于2019年7月4日周四 下午5:38写道: > > > > > > > Hi, michael > > > > > > > > Thanks for the advise. For the current pr, we can use two arrays where > > one > > > > records the message number and the other one corresponding offset to > > > > optimize the memory usage. For the franz's approch, we will also work > > on > > > > a early prototyping implementation. After that, we would take some > > basic > > > > tests in different scenarios. > > > > > > > > 于2019年7月2日周二 上午7:08写道: > > > > > > > >> Point though is an extra index cache layer is needed. The overhead of > > > >> that means the total paged capacity will be more limited as that > > overhead > > > >> isnt just an extra int per reference. E.g. in the pr the current impl > > isnt > > > >> very memory optimised, could an int array be used or at worst an open > > > >> primitive int int hashmap. > > > >> > > > >> > > > >> > > > >> > > > >> This is why i really prefer franz's approach. > > > >> > > > >> > > > >> > > > >> > > > >> Also what ever we do, we need the new behaviour configurable, so > > should a > > > >> use case not thought about they won't be impacted. E.g. the change > > should > > > >> not be a surprise, it should be something you toggle on. > > > >> > > > >> > > > >> > > > >> > > > >> Get Outlook for Android > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw" wrote: > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> Hi, > > > >> We've took a test against your configuration: > > > >> 5Mb10010Mb. > > > >> The current code: 7000msg/s sent and 18000msg/s received. > > > >> Pr code:16000msg/s received and 8200msg/s sent. > > > >> Like you said, the performance boosts by using much smaller page file > > and > > > >> holding many more for current code. > > > >> > > > >> Not sure what implications would have using smaller page file, the > > > >> producer > > > >> performance may reduce since switching files is more frequent, number > > of > > > >> file handle would increase? > > > >> > > > >> While our consumer in the test just echos, nothing to do after > > receiving > > > >> message, the consumer in the real world may be busy doing business. > > This > > > >> means references and page caches reside in memory longer and may be > > > >> evicted > > > >> more easily when producers are sending all the time. > > > >> > > > >> Since We don't know how many subscribers there are, it is not a > > scalable > > > >> approch. We can't reduce page file size unlimited to fit the number of > > > >> subscribers. The code should accommodate to all kinds of > > configurations. > > > >> We > > > >> adjust configuration for trade off as needed, not work around IMO. > > > >> In our company, ~200 queues(60% are owned by some addresses) are > > deployed > > > >> in the broker. We can't set all to e.g. 100 page caches(too much > > memory), > > > >> and neither set different size according to address pattern(hard for > > > >> operation). In the multi tenants cluster, we prefer availability and > > to > > > >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 > > and > > > >> max size to 31MB. It's running well in one of our clusters now:) > > > >> > > > >> 于2019年6月29日周六 上午2:35写道: > > > >> > > > >> > I think some of that is down to configuration. If you think you > > could > > > >> > configure paging to have much smaller page files but have many more > > > >> held. > > > >> > That way the reference sizes will be far smaller and pages dropping > > in > > > >> and > > > >> > out would be less. E.g. if you expect 100 being read make it 100 but > > > >> make > > > >> > the page sizes smaller so the overhead is far less > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > Get Outlook for Android > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw" wrote: > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > "At last for one message we maybe read twice: first we read page and > > > >> create > > > >> > pagereference; second we requery message after its reference is > > > >> removed. " > > > >> > > > > >> > I just realized it was wrong. One message maybe read many times. > > Think > > > >> of > > > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 > > msg, > > > >> > reading the whole page; When #2001~#4000 msg is deliverd, need to > > depage > > > >> > #4001~#6000 msg, reading page again, etc. > > > >> > > > > >> > One message maybe read three times if we don't depage until all > > messages > > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1 > > > >> which > > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little > > > >> bigger > > > >> > than page size), first depage round reads bottom half of p1 and top > > > >> part of > > > >> > p2; second depage round reads bottom half of p2 and top part of p3. > > > >> > Therforce p2 is read twice and m1 maybe read three times if > > requeryed. > > > >> > > > > >> > Be honest, i don't know how to fix the problem above with the > > > >> > decrentralized approch. The point is not how we rely on os cache, > > it's > > > >> that > > > >> > we do it the wrong way, shouldn't read whole page(50MB) just for > > ~2000 > > > >> > messages. Also there is no need to save 51MB PagedReferenceImpl in > > > >> memory. > > > >> > When 100 queues occupy 5100MB memory, the message references are > > very > > > >> > likely to be removed. > > > >> > > > > >> > > > > >> > Francesco Nigro 于2019年6月27日周四 下午5:05写道: > > > >> > > > > >> > > > > > > >> > > > which means the offset info is 100 times large compared to the > > > >> shared > > > >> > > > page index cache. > > > >> > > > > > >> > > > > > >> > > I would check with JOL plugin for exact numbers.. > > > >> > > I see with it that we would have an increase of 4 bytes for each > > > >> > > PagedRefeferenceImpl, totally decrentralized vs > > > >> > > a centralized approach (the cache). In the economy of a fully > > loaded > > > >> > > broker, if we care about scaling need to understand if the memory > > > >> > tradeoff > > > >> > > is important enough > > > >> > > to choose one of the 2 approaches. > > > >> > > My point is that paging could be made totally based on the OS page > > > >> cache > > > >> > if > > > >> > > GC would get in the middle, deleting any previous mechanism of > > page > > > >> > > caching...simplifying the process at it is. > > > >> > > Using a 2 level cache with such centralized approach can work, but > > > >> will > > > >> > add > > > >> > > a level of complexity that IMO could be saved... > > > >> > > What do you think could be the benefit of the decentralized > > solution > > > >> if > > > >> > > compared with the one proposed in the PR? > > > >> > > > > > >> > > > > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw ha > > > >> > > scritto: > > > >> > > > > > >> > > > Sorry, I missed the PageReferece part. > > > >> > > > > > > >> > > > The lifecyle of PageReference is: depage(in > > > >> > > intermediateMessageReferences) > > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in > > > >> deliveringRefs) > > > >> > -> > > > >> > > > removed. Every queue would create it's own PageReference which > > means > > > >> > the > > > >> > > > offset info is 100 times large compared to the shared page index > > > >> cache. > > > >> > > > If we keep 51MB pageReference size in memory, as i said in pr, > > "For > > > >> > > > multiple subscribers to the same address, just one executor is > > > >> > > responsible > > > >> > > > for delivering which means at the same moment only one queue is > > > >> > > delivering. > > > >> > > > Thus the queue maybe stalled for a long time. We get > > queueMemorySize > > > >> > > > messages into memory, and when we deliver these after a long > > time, > > > >> we > > > >> > > > probably need to query message and read page file again.". At > > last > > > >> for > > > >> > > one > > > >> > > > message we maybe read twice: first we read page and create > > > >> > pagereference; > > > >> > > > second we requery message after its reference is removed. > > > >> > > > > > > >> > > > For the shared page index cache design, each message just need > > to be > > > >> > read > > > >> > > > from file once. > > > >> > > > > > > >> > > > Michael Pearce 于2019年6月27日周四 下午3:03写道: > > > >> > > > > > > >> > > > > Hi > > > >> > > > > > > > >> > > > > First of all i think this is an excellent effort, and could > > be a > > > >> > > > potential > > > >> > > > > massive positive change. > > > >> > > > > > > > >> > > > > Before making any change on such scale, i do think we need to > > > >> ensure > > > >> > we > > > >> > > > > have sufficient benchmarks on a number of scenarios, not just > > one > > > >> use > > > >> > > > case, > > > >> > > > > and the benchmark tool used does need to be available openly > > so > > > >> that > > > >> > > > others > > > >> > > > > can verify the measures and check on their setups. > > > >> > > > > > > > >> > > > > Some additional scenarios i would want/need covering are: > > > >> > > > > > > > >> > > > > PageCache set to 5, and all consumers keeping up, but lagging > > > >> enough > > > >> > to > > > >> > > > be > > > >> > > > > reading from the same 1st page cache, latency and throughput > > need > > > >> to > > > >> > be > > > >> > > > > measured for all. > > > >> > > > > PageCache set to 5 and all consumers but one keeping up but > > > >> lagging > > > >> > > > enough > > > >> > > > > to be reading from the same 1st page cahce, but the one is > > falling > > > >> > off > > > >> > > > the > > > >> > > > > end, causing the page cache swapping, measure latecy and > > > >> througput of > > > >> > > > those > > > >> > > > > keeping up in the 1st page cache not caring for the one. > > > >> > > > > > > > >> > > > > Regards to solution some alternative approach to discuss > > > >> > > > > > > > >> > > > > In your scenario if i understand correctly each subscriber is > > > >> > > effectivly > > > >> > > > > having their own queue (1 to 1 mapping) not sharing. > > > >> > > > > You mention kafka and say multiple consumers doent read > > serailly > > > >> on > > > >> > the > > > >> > > > > address and this is true, but per queue processing through > > > >> messages > > > >> > > > > (dispatch) is still serial even with multiple shared > > consumers on > > > >> a > > > >> > > > queue. > > > >> > > > > > > > >> > > > > What about keeping the existing mechanism but having a queue > > hold > > > >> > > > reference > > > >> > > > > to a page cache that the queue is currently on, being kept > > from gc > > > >> > > (e.g. > > > >> > > > > not soft) therefore meaning page cache isnt being swapped > > around, > > > >> > when > > > >> > > > you > > > >> > > > > have queues (in your case subscribers) swapping pagecaches > > back > > > >> and > > > >> > > forth > > > >> > > > > avoidning the constant re-read issue. > > > >> > > > > > > > >> > > > > Also i think Franz had an excellent idea, do away with > > pagecache > > > >> in > > > >> > its > > > >> > > > > current form entirely, ensure the offset is kept with the > > > >> reference > > > >> > and > > > >> > > > > rely on OS caching keeping hot blocks/data. > > > >> > > > > > > > >> > > > > Best > > > >> > > > > Michael > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw wrote: > > > >> > > > > > > > >> > > > > > Hi, folks > > > >> > > > > > > > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance > > > >> > > degradation > > > >> > > > > > when there are a lot of subscribers". > > > >> > > > > > > > > >> > > > > > First apologize i didn't clarify our thoughts. > > > >> > > > > > > > > >> > > > > > As noted in the part of Environment, page-max-cache-size is > > set > > > >> to > > > >> > 1 > > > >> > > > > > meaning at most one page is allowed in softValueCache. We > > have > > > >> > tested > > > >> > > > > with > > > >> > > > > > the default page-max-cache-size which is 5, it would take > > some > > > >> time > > > >> > > to > > > >> > > > > > see the performance degradation since at start the cursor > > > >> positions > > > >> > > of > > > >> > > > > 100 > > > >> > > > > > subscribers are similar when all the messages read hits the > > > >> > > > > softValueCache. > > > >> > > > > > But after some time, the cursor positions are different. > > When > > > >> these > > > >> > > > > > positions are located more than 5 pages, it means some page > > > >> would > > > >> > be > > > >> > > > read > > > >> > > > > > back and forth. This can be proved by the trace log "adding > > > >> > pageCache > > > >> > > > > > pageNr=xxx into cursor = test-topic" in > > PageCursorProviderImpl > > > >> > where > > > >> > > > some > > > >> > > > > > pages are read a lot of times for the same subscriber. From > > the > > > >> > time > > > >> > > > on, > > > >> > > > > > the performance starts to degrade. So we set > > page-max-cache-size > > > >> > to 1 > > > >> > > > > > here just to make the test process more fast and it doesn't > > > >> change > > > >> > > the > > > >> > > > > > final result. > > > >> > > > > > > > > >> > > > > > The softValueCache would be removed if memory is really > > low, in > > > >> > > > addition > > > >> > > > > > the map size reaches capacity(default 5). In most cases, the > > > >> > > > subscribers > > > >> > > > > > are tailing read which are served by softValueCache(no need > > to > > > >> > bother > > > >> > > > > > disk), thus we need to keep it. But When some subscribers > > fall > > > >> > > behind, > > > >> > > > > they > > > >> > > > > > need to read page not in softValueCache. After looking up > > code, > > > >> we > > > >> > > > found > > > >> > > > > one > > > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS > > deliver > > > >> > round > > > >> > > > in > > > >> > > > > > most situations, and that's to say at most > > > >> MAX_DELIVERIES_IN_LOOP * > > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged > > next. > > > >> If > > > >> > > you > > > >> > > > > > adjust QueueImpl logger to debug level, you would see logs > > like > > > >> > > "Queue > > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with > > maxSize > > > >> = > > > >> > > > > 52428800. > > > >> > > > > > Depaged 68 messages, pendingDelivery=1002, > > > >> > > > intermediateMessageReferences= > > > >> > > > > > 23162, queueDelivering=0". In order to depage less than 2000 > > > >> > > messages, > > > >> > > > > > each subscriber has to read a whole page which is > > unnecessary > > > >> and > > > >> > > > > wasteful. > > > >> > > > > > In our test where one page(50MB) contains ~40000 messages, > > one > > > >> > > > subscriber > > > >> > > > > > maybe read 40000/2000=20 times of page if softValueCache is > > > >> evicted > > > >> > > to > > > >> > > > > > finish delivering it. This has drastically slowed down the > > > >> process > > > >> > > and > > > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and > > read > > > >> one > > > >> > > > > message > > > >> > > > > > each time rather than read all messages of page. In this > > way, > > > >> for > > > >> > > each > > > >> > > > > > subscriber each page is read only once after finishing > > > >> delivering. > > > >> > > > > > > > > >> > > > > > Having said that, the softValueCache is used for tailing > > read. > > > >> If > > > >> > > it's > > > >> > > > > > evicted, it won't be reloaded to prevent from the issue > > > >> illustrated > > > >> > > > > above. > > > >> > > > > > Instead the pageIndexCache would be used. > > > >> > > > > > > > > >> > > > > > Regarding implementation details, we noted that before > > > >> delivering > > > >> > > > page, a > > > >> > > > > > pageCursorInfo is constructed which needs to read the whole > > > >> page. > > > >> > We > > > >> > > > can > > > >> > > > > > take this opportunity to construct the pageIndexCache. It's > > very > > > >> > > simple > > > >> > > > > to > > > >> > > > > > code. We also think of building a offset index file and some > > > >> > concerns > > > >> > > > > > stemed from following: > > > >> > > > > > > > > >> > > > > > 1. When to write and sync index file? Would it have some > > > >> > > performance > > > >> > > > > > implications? > > > >> > > > > > 2. If we have a index file, we can construct > > pageCursorInfo > > > >> > > through > > > >> > > > > > it(no need to read the page like before), but we need to > > > >> write > > > >> > the > > > >> > > > > total > > > >> > > > > > message number into it first. Seems a little weird > > putting > > > >> this > > > >> > > into > > > >> > > > > the > > > >> > > > > > index file. > > > >> > > > > > 3. If experiencing hard crash, a recover mechanism would > > be > > > >> > needed > > > >> > > > to > > > >> > > > > > recover page and page index files, E.g. truncating to the > > > >> valid > > > >> > > > size. > > > >> > > > > So > > > >> > > > > > how do we know which files need to be sanity checked? > > > >> > > > > > 4. A variant binary search algorithm maybe needed, see > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala > > > >> > > > > > . > > > >> > > > > > 5. Unlike kafka from which user fetches lots of messages > > at > > > >> once > > > >> > > and > > > >> > > > > > broker just needs to look up start offset from the index > > file > > > >> > > once, > > > >> > > > > artemis > > > >> > > > > > delivers message one by one and that means we have to > > look up > > > >> > the > > > >> > > > > index > > > >> > > > > > every time we deliver a message. Although the index file > > is > > > >> > > possibly > > > >> > > > > in > > > >> > > > > > page cache, there are still chances we miss cache. > > > >> > > > > > 6. Compatibility with old files. > > > >> > > > > > > > > >> > > > > > To sum that, kafka uses a mmaped index file and we use a > > index > > > >> > cache. > > > >> > > > > Both > > > >> > > > > > are designed to find physical file position according > > > >> offset(kafka) > > > >> > > or > > > >> > > > > > message number(artemis). And we prefer the index cache bcs > > it's > > > >> > easy > > > >> > > to > > > >> > > > > > understand and maintain. > > > >> > > > > > > > > >> > > > > > We also tested the one subscriber case with the same setup. > > > >> > > > > > The original: > > > >> > > > > > consumer tps(11000msg/s) and latency: > > > >> > > > > > [image: orig_single_subscriber.png] > > > >> > > > > > producer tps(30000msg/s) and latency: > > > >> > > > > > [image: orig_single_producer.png] > > > >> > > > > > The pr: > > > >> > > > > > consumer tps(14000msg/s) and latency: > > > >> > > > > > [image: pr_single_consumer.png] > > > >> > > > > > producer tps(30000msg/s) and latency: > > > >> > > > > > [image: pr_single_producer.png] > > > >> > > > > > It showed result is similar and event a little better in the > > > >> case > > > >> > of > > > >> > > > > > single subscriber. > > > >> > > > > > > > > >> > > > > > We used our inner test platform and i think jmeter can also > > be > > > >> used > > > >> > > to > > > >> > > > > > test again it. > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > Clebert Suconic > > -- Clebert Suconic
