On Tue, 21 Apr 2020 at 16:48, 'GANESH VERNEKAR' via Prometheus Developers < [email protected]> wrote:
> > > On Tuesday, April 21, 2020 at 8:59:12 PM UTC+5:30, Brian Brazil wrote: >> >> On Tue, 21 Apr 2020 at 15:17, 'GANESH VERNEKAR' via Prometheus Developers >> <[email protected]> wrote: >> >>> Hi All, >>> >>> I would like to propose a callback for the TSDB for series creation and >>> deletion. This would help external projects like Cortex and Thanos in >>> implementing limits on series while having no performance impact on >>> Prometheus with minimal code addition. >>> >>> The callback would look something like this (names TBD) with no >>> performance impact on TSDB. >>> >>> >>> type SeriesLifecycleCallback interface { >>> >>> PreCreationCallback(labels.Labels) error >>> >>> PostCreationCallback(labels.Labels) >>> >>> PostDeletionCallback(labels.Labels) >>> >>> } >>> >>> *1. SeriesLifecycleCallback* >>> >>> It would be a part of DB.Options >>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/db.go#L73-L116> >>> >>> *2. PreCreationCallback* >>> >>> It would be called just before holding the lock here >>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1654>. >>> And if the method returns an error, it means we don’t want to create a new >>> series. >>> >>> Further the hashes would be checked as usual >>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1656-L1659>, >>> and if we reach the step of adding a new hash >>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1660>, >>> it would be either executed or skipped based on the result of >>> `PreCreationCallback`. >>> >>> If the series creation was skipped, the error returned by >>> `PreCreationCallback` would be propagated back into the call stack and the >>> `Add` or `AddFast` method would finally return back the error. (This means >>> additional return values for the methods in that call stack). >>> >>> *3. PostCreationCallback* >>> >>> This would be called after releasing the lock here >>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1661> >>> letting the caller know that the series was created. >>> >>> *4. PostDeletionCallback * >>> >>> >>> It would be called after deleting a series during garbage collection here >>> (after unlocking) >>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1625> >>> so that external projects can do the bookkeeping. >>> >>> ---- >>> >>> In the context of Prometheus, the interface would be a noop, hence no >>> performance impact. >>> >> >> It'd still cost a few cycles. >> > > But it would be very less to even notice. > > >> >>> >>> *Some reasoning behind the design decisions and what guarantees it >>> provides:* >>> >>> The idea was to avoid having the callbacks inside the locked section >>> which could lead to performance degradation. Hence the Series creation >>> callback was split into 2 methods. PreCreationCallback method tells >>> whether to create the series if we need to create and the next method >>> `PostCreationCallback` to let the user know if it was created, so that >>> appropriate bookkeeping can be done. >>> >>> Since all the callbacks happen outside the locks (even if it was inside >>> the lock, there could be concurrent calls for different shards of the >>> stripe series), the user can expect some soft consistencies here. For >>> example, the PreCreationCallback could be called concurrently by >>> multiple series creation when you are at the edge of the series limit and >>> all the series creation would be allowed, which might cause the limit to >>> cross by a small number. Similarly during the deletion phase: some creation >>> of series would fail as the deletion callback is called a little later. >>> Hence the user is required to expect soft consistencies while using these >>> callbacks. >>> >>> What do you all think about this addition in TSDB? >>> >> >> This seems to require adding error handling for cases where Prometheus >> cannot have errors, what happens to a scrape or rule evaluation when this >> fails? >> > > The Add > <https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L931> > and AddFast > <https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L957> > methods already returns an error. So failure in series creation would > follow the same procedure. But thinking about it more, it will be a no-op > for Prometheus, and I don't think the external users who would use this > feature will have scraping in it (considering it as a "TSDB" package and > not "Prometheus" package). > > >> I'm also not sure it's considering that the semantics of how this is all >> managed inside the tsdb is not what you might expect. For example a failed >> scrape due to hitting the sample limit can still cause series to be >> created, which usually doesn't come up but might now. >> > > Yes, all the changes in TSDB are not totally explored yet, I will work on > a draft to see what all needs changing. > > Additionally, series is still created currently if the appender Commit() > fails (a failed scrape?). > A failed scrape would be a Rollback. A WAL error is what would cause a Commit to fail, which is likely a full filesystem. -- Brian Brazil www.robustperception.io -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLo0p91ezU-yXE6S1%2BP1V%3D2ZSvUUryFbWgZRMRjhW0p06Q%40mail.gmail.com.

