On Tue, 21 Apr 2020 at 16:48, 'GANESH VERNEKAR' via Prometheus Developers <
[email protected]> wrote:

>
>
> On Tuesday, April 21, 2020 at 8:59:12 PM UTC+5:30, Brian Brazil wrote:
>>
>> On Tue, 21 Apr 2020 at 15:17, 'GANESH VERNEKAR' via Prometheus Developers
>> <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> I would like to propose a callback for the TSDB for series creation and
>>> deletion. This would help external projects like Cortex and Thanos in
>>> implementing limits on series while having no performance impact on
>>> Prometheus with minimal code addition.
>>>
>>> The callback would look something like this (names TBD) with no
>>> performance impact on TSDB.
>>>
>>>
>>> type SeriesLifecycleCallback interface {
>>>
>>> PreCreationCallback(labels.Labels) error
>>>
>>> PostCreationCallback(labels.Labels)
>>>
>>> PostDeletionCallback(labels.Labels)
>>>
>>> }
>>>
>>> *1. SeriesLifecycleCallback*
>>>
>>> It would be a part of DB.Options
>>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/db.go#L73-L116>
>>>
>>> *2. PreCreationCallback*
>>>
>>> It would be called just before holding the lock here
>>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1654>.
>>> And if the method returns an error, it means we don’t want to create a new
>>> series.
>>>
>>> Further the hashes would be checked as usual
>>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1656-L1659>,
>>> and if we reach the step of adding a new hash
>>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1660>,
>>> it would be either executed or skipped based on the result of
>>> `PreCreationCallback`.
>>>
>>> If the series creation was skipped, the error returned by
>>> `PreCreationCallback` would be propagated back into the call stack and the
>>> `Add` or `AddFast` method would finally return back the error. (This means
>>> additional return values for the methods in that call stack).
>>>
>>> *3. PostCreationCallback*
>>>
>>> This would be called after releasing the lock here
>>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1661>
>>> letting the caller know that the series was created.
>>>
>>> *4. PostDeletionCallback *
>>>
>>>
>>> It would be called after deleting a series during garbage collection here
>>> (after unlocking)
>>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1625>
>>> so that external projects can do the bookkeeping.
>>>
>>> ----
>>>
>>> In the context of Prometheus, the interface would be a noop, hence no
>>> performance impact.
>>>
>>
>> It'd still cost a few cycles.
>>
>
> But it would be very less to even notice.
>
>
>>
>>>
>>> *Some reasoning behind the design decisions and what guarantees it
>>> provides:*
>>>
>>> The idea was to avoid having the callbacks inside the locked section
>>> which could lead to performance degradation. Hence the Series creation
>>> callback was split into 2 methods. PreCreationCallback method tells
>>> whether to create the series if we need to create and the next method
>>> `PostCreationCallback` to let the user know if it was created, so that
>>> appropriate bookkeeping can be done.
>>>
>>> Since all the callbacks happen outside the locks (even if it was inside
>>> the lock, there could be concurrent calls for different shards of the
>>> stripe series), the user can expect some soft consistencies here. For
>>> example, the PreCreationCallback could be called concurrently by
>>> multiple series creation when you are at the edge of the series limit and
>>> all the series creation would be allowed, which might cause the limit to
>>> cross by a small number. Similarly during the deletion phase: some creation
>>> of series would fail as the deletion callback is called a little later.
>>> Hence the user is required to expect soft consistencies while using these
>>> callbacks.
>>>
>>> What do you all think about this addition in TSDB?
>>>
>>
>> This seems to require adding error handling for cases where Prometheus
>> cannot have errors, what happens to a scrape or rule evaluation when this
>> fails?
>>
>
> The Add
> <https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L931>
> and AddFast
> <https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L957>
> methods already returns an error. So failure in series creation would
> follow the same procedure. But thinking about it more, it will be a no-op
> for Prometheus, and I don't think the external users who would use this
> feature will have scraping in it (considering it as a "TSDB" package and
> not "Prometheus" package).
>
>
>> I'm also not sure it's considering that the semantics of how this is all
>> managed inside the tsdb is not what you might expect. For example a failed
>> scrape due to hitting the sample limit can still cause series to be
>> created, which usually doesn't come up but might now.
>>
>
> Yes, all the changes in TSDB are not totally explored yet, I will work on
> a draft to see what all needs changing.
>



> Additionally, series is still created currently if the appender Commit()
> fails (a failed scrape?).
>

A failed scrape would be a Rollback. A WAL error is what would cause a
Commit to fail, which is likely a full filesystem.

-- 
Brian Brazil
www.robustperception.io

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CAHJKeLo0p91ezU-yXE6S1%2BP1V%3D2ZSvUUryFbWgZRMRjhW0p06Q%40mail.gmail.com.

Reply via email to