On Tuesday, April 21, 2020 at 8:59:12 PM UTC+5:30, Brian Brazil wrote:
>
> On Tue, 21 Apr 2020 at 15:17, 'GANESH VERNEKAR' via Prometheus Developers <
> [email protected] <javascript:>> wrote:
>
>> Hi All,
>>
>> I would like to propose a callback for the TSDB for series creation and 
>> deletion. This would help external projects like Cortex and Thanos in 
>> implementing limits on series while having no performance impact on 
>> Prometheus with minimal code addition.
>>
>> The callback would look something like this (names TBD) with no 
>> performance impact on TSDB.
>>
>>
>> type SeriesLifecycleCallback interface {
>>
>> PreCreationCallback(labels.Labels) error
>>
>> PostCreationCallback(labels.Labels)
>>
>> PostDeletionCallback(labels.Labels)
>>
>> }
>>
>> *1. SeriesLifecycleCallback*
>>
>> It would be a part of DB.Options 
>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/db.go#L73-L116>
>>
>> *2. PreCreationCallback*
>>
>> It would be called just before holding the lock here 
>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1654>.
>>  
>> And if the method returns an error, it means we don’t want to create a new 
>> series. 
>>
>> Further the hashes would be checked as usual 
>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1656-L1659>,
>>  
>> and if we reach the step of adding a new hash 
>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1660>,
>>  
>> it would be either executed or skipped based on the result of 
>> `PreCreationCallback`.
>>
>> If the series creation was skipped, the error returned by 
>> `PreCreationCallback` would be propagated back into the call stack and the 
>> `Add` or `AddFast` method would finally return back the error. (This means 
>> additional return values for the methods in that call stack).
>>
>> *3. PostCreationCallback*
>>
>> This would be called after releasing the lock here 
>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1661>
>>  
>> letting the caller know that the series was created.
>>
>> *4. PostDeletionCallback *
>>
>>
>> It would be called after deleting a series during garbage collection here 
>> (after unlocking) 
>> <https://github.com/prometheus/prometheus/blob/fc3fb3265a0f0061d07d0bf5fb4103e2144565b2/tsdb/head.go#L1625>
>>  
>> so that external projects can do the bookkeeping.
>>
>> ----
>>
>> In the context of Prometheus, the interface would be a noop, hence no 
>> performance impact.
>>
>
> It'd still cost a few cycles.
>

But it would be very less to even notice.

 
>
>>
>> *Some reasoning behind the design decisions and what guarantees it 
>> provides:*
>>
>> The idea was to avoid having the callbacks inside the locked section 
>> which could lead to performance degradation. Hence the Series creation 
>> callback was split into 2 methods. PreCreationCallback method tells 
>> whether to create the series if we need to create and the next method 
>> `PostCreationCallback` to let the user know if it was created, so that 
>> appropriate bookkeeping can be done.
>>
>> Since all the callbacks happen outside the locks (even if it was inside 
>> the lock, there could be concurrent calls for different shards of the 
>> stripe series), the user can expect some soft consistencies here. For 
>> example, the PreCreationCallback could be called concurrently by 
>> multiple series creation when you are at the edge of the series limit and 
>> all the series creation would be allowed, which might cause the limit to 
>> cross by a small number. Similarly during the deletion phase: some creation 
>> of series would fail as the deletion callback is called a little later. 
>> Hence the user is required to expect soft consistencies while using these 
>> callbacks.
>>
>> What do you all think about this addition in TSDB?
>>
>
> This seems to require adding error handling for cases where Prometheus 
> cannot have errors, what happens to a scrape or rule evaluation when this 
> fails?
>

The Add 
<https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L931>
 
and AddFast 
<https://github.com/prometheus/prometheus/blob/4b5e7d4984a135abd3729b9a2a5904336d35bc21/tsdb/head.go#L957>
 
methods already returns an error. So failure in series creation would 
follow the same procedure. But thinking about it more, it will be a no-op 
for Prometheus, and I don't think the external users who would use this 
feature will have scraping in it (considering it as a "TSDB" package and 
not "Prometheus" package).
 

> I'm also not sure it's considering that the semantics of how this is all 
> managed inside the tsdb is not what you might expect. For example a failed 
> scrape due to hitting the sample limit can still cause series to be 
> created, which usually doesn't come up but might now. 
>

Yes, all the changes in TSDB are not totally explored yet, I will work on a 
draft to see what all needs changing. Additionally, series is still created 
currently if the appender Commit() fails (a failed scrape?).
 

> -- 
> Brian Brazil
> www.robustperception.io
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/47ae1780-e4f6-4321-93a5-01a0863c793a%40googlegroups.com.

Reply via email to