On Tuesday, April 21, 2020 at 8:59:12 PM UTC+5:30, Brian Brazil wrote: > > On Tue, 21 Apr 2020 at 15:17, 'GANESH VERNEKAR' via Prometheus Developers < > [email protected] <javascript:>> wrote: > >> Hi All, >> >> I would like to propose a callback for the TSDB for series creation and >> deletion. This would help external projects like Cortex and Thanos in >> implementing limits on series while having no performance impact on >> Prometheus with minimal code addition. >> >> The callback would look something like this (names TBD) with no >> performance impact on TSDB. >> >> >> type SeriesLifecycleCallback interface { >> >> PreCreationCallback(labels.Labels) error >> >> PostCreationCallback(labels.Labels) >> >> PostDeletionCallback(labels.Labels) >> >> } >> >> *1. SeriesLifecycleCallback* >> >> It would be a part of DB.Options >> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/db.go#L73-L116> >> >> *2. PreCreationCallback* >> >> It would be called just before holding the lock here >> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1654>. >> >> And if the method returns an error, it means we don’t want to create a new >> series. >> >> Further the hashes would be checked as usual >> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1656-L1659>, >> >> and if we reach the step of adding a new hash >> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1660>, >> >> it would be either executed or skipped based on the result of >> `PreCreationCallback`. >> >> If the series creation was skipped, the error returned by >> `PreCreationCallback` would be propagated back into the call stack and the >> `Add` or `AddFast` method would finally return back the error. (This means >> additional return values for the methods in that call stack). >> >> *3. PostCreationCallback* >> >> This would be called after releasing the lock here >> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1661> >> >> letting the caller know that the series was created. >> >> *4. PostDeletionCallback * >> >> >> It would be called after deleting a series during garbage collection here >> (after unlocking) >> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1625> >> >> so that external projects can do the bookkeeping. >> >> ---- >> >> In the context of Prometheus, the interface would be a noop, hence no >> performance impact. >> > > It'd still cost a few cycles. >
But it would be very less to even notice. > >> >> *Some reasoning behind the design decisions and what guarantees it >> provides:* >> >> The idea was to avoid having the callbacks inside the locked section >> which could lead to performance degradation. Hence the Series creation >> callback was split into 2 methods. PreCreationCallback method tells >> whether to create the series if we need to create and the next method >> `PostCreationCallback` to let the user know if it was created, so that >> appropriate bookkeeping can be done. >> >> Since all the callbacks happen outside the locks (even if it was inside >> the lock, there could be concurrent calls for different shards of the >> stripe series), the user can expect some soft consistencies here. For >> example, the PreCreationCallback could be called concurrently by >> multiple series creation when you are at the edge of the series limit and >> all the series creation would be allowed, which might cause the limit to >> cross by a small number. Similarly during the deletion phase: some creation >> of series would fail as the deletion callback is called a little later. >> Hence the user is required to expect soft consistencies while using these >> callbacks. >> >> What do you all think about this addition in TSDB? >> > > This seems to require adding error handling for cases where Prometheus > cannot have errors, what happens to a scrape or rule evaluation when this > fails? > The Add <https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L931> and AddFast <https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L957> methods already returns an error. So failure in series creation would follow the same procedure. But thinking about it more, it will be a no-op for Prometheus, and I don't think the external users who would use this feature will have scraping in it (considering it as a "TSDB" package and not "Prometheus" package). > I'm also not sure it's considering that the semantics of how this is all > managed inside the tsdb is not what you might expect. For example a failed > scrape due to hitting the sample limit can still cause series to be > created, which usually doesn't come up but might now. > Yes, all the changes in TSDB are not totally explored yet, I will work on a draft to see what all needs changing. Additionally, series is still created currently if the appender Commit() fails (a failed scrape?). > -- > Brian Brazil > www.robustperception.io > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/47ae1780-e4f6-4321-93a5-01a0863c793a%40googlegroups.com.

