Hi Giedrius, thanks for looking at this! Before I submit the fix PRs (plural, see below) I need some feedback on the intended implementation. I don't see a simple fix which could be contained in Pushgateway only. It seems that some changes to Prometheus code (prometheus/client_golang to be specific) are required in addition to modifying Pushgateway.
The current logic for the duplicates check is in client_golang in the Gather() method: https://github.com/prometheus/client_golang/blob/aea1a5996a9d8119592baea7310810c65dc598f5/prometheus/registry.go#L424 Unfortunately this API can only take a whole set of metrics and answer if it's consistent. It does so by calculating hashes for the whole set, which in case of Pushgateway leads to quadratic complexity. Pushgateway keeps a dynamic set of metrics and needs to keep track of its consistency and you cannot do it efficiently using the current Gather() API. So my plan is to: * factor out the consistency logic from client_golang's Gather() so that it also works for a dynamically changing set of metrics * basically expose a data structure which keeps the set of metrics with their hashes * use this logic in Pushgateway The alternative is to do a Pushgateway-only fix but this would require duplicating the logic for consistency checks, so it's probably worse than the fix above. Does that sound sensible? -- Rafał On Mon, Jan 6, 2025 at 1:08 PM Giedrius Statkevičius <giedriusw...@gmail.com> wrote: > Hello, > > I'm not "a Prometheus dev" but this is something that I am interested in. > Could you open up a PR with the benchmark and the fix? I'll help out with > reviewing. > > Thanks, > Giedrius > > On Saturday, 4 January 2025 at 09:52:01 UTC+2 Rafał Dowgird wrote: > >> Dear and Esteemed Prometheus developers, >> >> I'd like to discuss with you a performance problem with Pushgateway, >> namely that the complexity of adding n metrics might get quadratic >> (O(n^2)). Details follow. >> >> We have a mixed push/scrape system where Pushgateway handles some of the >> metrics which come from batch jobs. While migrating some jobs to >> Pushgateway we hit a performance bottleneck. We worked around this by >> sharding Pushgateway. Still the sharded setup is more complex and the >> amount of data wasn't that big, so we investigated the Pushgateway side of >> things. >> >> It seems that the root of the problem is that every push operation causes >> recalculation of hashes for all metrics already existing in the database. >> This is how the consistency check logic works at present. >> >> I have created a simple benchmark to isolate/demonstrate the problem: >> https://github.com/dowgird/pushgateway/commit/ >> e0629ecb999c2f22cf098c87c78fc71cd0414733 >> >> The output demonstrates that subsequent pushes of metrics get linearly >> slower: >> >> I: 100 elapsed:220.379138ms diff:220.379138ms >> I: 200 elapsed:505.576881ms diff:285.197743ms >> I: 300 elapsed:841.153205ms diff:335.576324ms >> . >> . >> . >> I: 2700 elapsed:21.806380441s diff:1.391117119s >> I: 2800 elapsed:23.229272852s diff:1.422892411s >> I: 2900 elapsed:24.674250223s diff:1.444977371s >> >> Possible fix doesn't look very complicated algorithmically (memorizing >> the hashes should work). Code-wise it's a bit more complex, which is a part >> of why I'm writing this message. I can contribute the fix but this would >> require some discussion of client API. >> >> The other part is that I understand from documentation and communications >> on github issues that Pushgateway is not meant to be high performance. That >> said, I still think it would be beneficial to remove this particular >> performance bottleneck - there seem to be other people hitting it ( >> https://github.com/prometheus/pushgateway/issues/643 might be caused by >> this). >> >> Would you be open to accepting a fix for this issue? >> >> -- >> Rafał > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to prometheus-developers+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/prometheus-developers/b53b5c14-a4ea-4c21-8a53-aeb1cb0a6036n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-developers/b53b5c14-a4ea-4c21-8a53-aeb1cb0a6036n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/prometheus-developers/CALJdysgq6nBpnn6d9COiOkwEcu53TcWDBY3dQma1qOjW9cw46A%40mail.gmail.com.