Thank you everyone for the suggestions!

I agree with the age-based solutions, but such a solution is useful to 
particularly those systems that have a limitation on time. Many don't have 
that. But seeing the scenario, can we have both, so if users have a 
remote-storage system that respects time, then they can use the time-based 
dropping logic. If the user has a remote-storage that can accept a sample 
with any timestamp (past or future), he can use the retries count method. 
This will avoid recurring errors, like the null byte.

We can have something like *LimitRetryPolicy* as *time* or *retries*. If 
its *time*, we choose the max time (taken as input). If the policy is 
*retries*, then a count would be the input for the maximum retries. That 
way, we solve both the problems and leave it up to the user to consider it, 
based on the storage system he is using.

Does that look good to go, or we do just the age-based way?

Thank you

On Tuesday, March 2, 2021 at 6:32:02 AM UTC+5:30 [email protected] wrote:

> Harkishen, thank you very much for the design document!
>
> My initial thoughts are to agree with Stuart (as well as some users in the 
> linked github issue) that it makes the most sense to start with dropping 
> data that is older than some configured age. The default being to never 
> drop data. For most outage scenarios I think this is the easiest to 
> understand, and if there is an outage retrying old data x times still does 
> not help you much.
>
> There are a couple use cases that an age based solution doesn't solve 
> ideally:
> 1. An issue where bad data is causing the upstream system to break, e.g. I 
> have seen a system return a 5xx due to a null byte in a label value causing 
> some sort of panic. This blocks Prometheus from being able to process any 
> samples newer than that bad sample. Yes this is an issue with the remote 
> storage, but it sucks when it happens and it would be nice to have an easy 
> workaround while a fix goes into the remote system. In this scenario, only 
> dropping old data still means you wouldn't be sending anything new for 
> quite awhile, and if the bad data is persistent you would likely just end 
> up 10minutes to an hour behind permanently (whatever you set the age to be).
> 2. Retrying 429 errors, a new feature currently behind a flag, but it 
> could make sense to only retry 429s a couple of times (if you want to retry 
> them at all) but then drop the data so that non-rate limited requests can 
> proceed in the future.
>
> I think to start with the above limitations are fine and the age based 
> system is probably the way to go. I also wonder if it is worth defining a 
> more generic "retry_policies" section of remote write that could contain 
> different options for 5xx vs 429.
>
> On Mon, Mar 1, 2021 at 3:32 AM Ben Kochie <[email protected]> wrote:
>
>> If a remote write receiver is unable to ingest, wouldn't this be 
>> something to fix on the receiver side? The receiver could have a policy 
>> where it drops data rather than returning an error.
>>
>> This way Prometheus sends, but doesn't have to need to know or deal with 
>> ingestion policies. It sends a bit more data over the wire, but that part 
>> is cheap compared to the ingestion costs.
>>
>
> I certainly see the argument that this could all be cast as a 
> receiver-side issue, but I have also personally experienced outages that 
> were much harder to recover from due to a thundering herd scenario once the 
> service was restored. E.g. cortex distributors (where an ingestion policy 
> would be implemented) effectively locking up or OOMing at a high enough 
> request rate. Also, an administrator may not be able to update whatever 
> remote storage solution they use. This becomes even more painful in a 
> resource constrained environment. The solution right now is to go restart 
> all of your Prometheus instances to indiscriminately drop data, I would 
> prefer to be intentional about what data is dropped.
>
> I would certainly be happy to jump on a call sometime with interested 
> parties if that would be more efficient :)
>
> Chris
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/e9c28b1b-feb2-4f8f-8d98-a4a2a982956bn%40googlegroups.com.

Reply via email to