If a remote write receiver is unable to ingest, wouldn't this be something to fix on the receiver side? The receiver could have a policy where it drops data rather than returning an error.
This way Prometheus sends, but doesn't have to need to know or deal with ingestion policies. It sends a bit more data over the wire, but that part is cheap compared to the ingestion costs. On Mon, Mar 1, 2021 at 11:13 AM Stuart Clark <[email protected]> wrote: > On 01/03/2021 07:25, Harkishen Singh wrote: > > Hi Tom, > > > > I have tried to answer the comments. Please comment on their > > satisfactoriness. I am happy for a call if required (or discussion > > gets tough). > > > > I think, the lossless nature can be controlled by the user based on > > the config (limit_retries), and let the users have more control, as to > > whether they are happy to compromise a bit, if the retry is too much, > > since as such, if the retrying happens forever, then I don't think > > that is helpful (it will never be accepted by the remote storage). > > Also as Chris mentioned, some users might prefer to have few gaps and > > give more priority to recent data, like for alerting. So, I think this > > approach gives more flexibility to the user, at the same time, making > > it optional (or by setting the retry count high enough). > > > Under what situations would retries happen forever? > > If the receiver is available but cannot accept the data (for example due > to metric size limits or age of the samples) I would expect it to reject > with a 4XX code (permanent failure) which wouldn't trigger any retries. > > Alternatively if the receiver is either unavailable or broken it could > result in "infinite" retries, but in that situation it feels like an age > based limit instead of retry limit would be better - a short retry limit > will reject samples that have just been scraped just as quickly as > samples that are days old. Instead it sounds like an age based limit > would be better - some systems have restrictions over what age can be > ingested (e.g. Timestream) or administrators could decide older data has > no usefulness (e.g. if the receiver is used for alerting or anomaly > detection. While the system should still reject such old samples once it > is working again a time based limit would at least reduce the network > impact once the receiver is back online (no need to send tons of data > that we know will be rejected). > > -- > Stuart Clark > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Developers" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-developers/cd97f615-e479-e4be-e85d-672b15c337d8%40Jahingo.com > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CABbyFmpOC8EPAnHsj0Zyh5JSworYLciDL6nCXyzSSnHAX981RA%40mail.gmail.com.

