nicusX commented on code in PR #766: URL: https://github.com/apache/flink-web/pull/766#discussion_r1860867485
########## docs/content/posts/2024-11-26-introducing-new-prometheus-connector.md: ########## @@ -0,0 +1,202 @@ +--- +title: "Introducing the new Prometheus connector" +date: "2024-11-26T00:00:00.000Z" +authors: +- nicusX: + name: "Lorenzo Nicora" +--- + + +We are excited to announce a new sink connector that enables writing data to Prometheus ([FLIP-312](https://cwiki.apache.org/confluence/display/FLINK/FLIP-312:+Prometheus+Sink+Connector)). This articles introduces the main features of the connector, and the reasoning behind design decisions. + +This connector allows writing data to Prometheus using the [Remote-Write](https://prometheus.io/docs/specs/remote_write_spec/) push interface, which lets you write time-series data to Prometheus at scale. + +## Motivations for a Prometheus connector + +Prometheus is an efficient time-series database optimized for building real-time dashboards and alerts, typically in combination with Grafana or other visualization tools. + +Prometheus is commonly used to monitor compute resources, IT infrastructure, Kubernetes clusters, applications, and cloud resources. It can also be used to observe your Flink cluster and Flink jobs. Flink existing [Metric Reporters](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/) has this purpose. + +So, why do we need a connector? + +Prometheus can serve as a general-purpose observability time-series database, beyond traditional infrastructure monitoring. For example, it can be used to monitor IoT devices, sensors, connected cars, media streaming devices, and any resource that streams events or measurements continuously. + +Observability data from these use cases differs from metrics generated by compute resources. They present additional challenges: +* **Out-of-order events**: Devices may be connected via mobile networks or even Bluetooth. Events from different devices may follow different paths and arrive at very different times. A **stateful, event-time logic** can be used to reorder them. +* **High frequency** and **high cardinality**: You can have a sheer number of devices, each emitting signals multiple times per second. **Aggregating over time** and **over dimensions** can reduce frequency and cardinality and make the volume of data more efficiently analysable. +* **Lack of contextual information**: Raw events sent by the devices often lack of contextual information for a meaningful analysis. **Enrichment** of raw events, looking up some reference dataset, can be used to add dimensions useful for the analysis. +* **Noise**: sensor measurement may contain noise. For example when a GPS tracker lose connection and reports spurious positions. These obvious outliers can be **filtered** out to simplify visualization and analysis. + +Flink can be used as a pre-processor to address all the above. + +You can implement a sink from scratch or use AsyncIO to call the Prometheus Remote-Write endpoint. However, there are not-trivial details to implement an efficient Remote-Write client: +* There is no high-level client for Prometheus Remote-Write. You would need to build on top of a low-level HTTP client. +* Remote-Write can be inefficient unless write requests are batched and parallelized. +* Error handling can be complex, and specifications demand strict behaviors (see [Strict Specifications, Lenient Implementations](#strict-specifications-lenient-implementations)). + +The new Prometheus connector manages all of this for you. + +## Key features + +The version `1.0.0` of the Prometheus connector has the following features: + +* DataStream API, Java Sink, based on AsyncSinkBase. +* Configurable write batching. +* Order is retained on retries. +* At-most-once guarantees (we’ll explain the reason for this later). +* Supports parallelism greater than 1. +* Configurable retry strategy for retryable server responses (e.g., throttling) and configurable behavior when the retry limit is reached. +* Configurable behavior on retryable errors (e.g., throttling from the Prometheus backend). + +### Authentication + +Authentication is [explicitly out of scope](https://prometheus.io/docs/specs/remote_write_spec/#out-of-scope) for Prometheus Remote-Write specifications. +The connector provides a generic interface, PrometheusRequestSigner, to manipulate requests and add headers. This allows implementation of any authentication scheme that requires adding headers to the request, such as API keys, authorization, or signature tokens. + +In the release `1.0.0`, an implementation for Amazon Managed Service for Prometheus (AMP) is provided as a separate, optional dependency. + +## Designing the connector + +A sink to Prometheus differs from a sink to most other datastores or time-series databases. The wire interface is the easiest part; the main challenges arise from the differing design goals of Prometheus and Flink, as well as from the [Remote-Write specifications](https://prometheus.io/docs/specs/remote_write_spec/). Let’s go through some of these aspects to better understand the behavior of this connector. + +### Data model and wire protocol + +Prometheus is a dimensional time-series database. Each time-series is represented by a series of Samples (timestamp and value), identified by a set of unique Labels (key, value). Labels represents the dimensions with one label named `__name__` playing the special role of metric name. The set of Labels represent the unique identifier of the time-series. Records (Samples) are simply the values (doubles) in order of timestamp (long, as UNIX timestamp since the Epoch). + +<center> +<br/> +<img src="/img/blog/2024-11-26-introducing-new-prometheus-connector/prometheus-data-model.png" width="60%"/> +<br/> +Fig. 1 - Logical data model of the Prometheus TSDB. +</center> + +The wire-protocol payload defined by the [Remote-Write specifications](https://prometheus.io/docs/specs/remote_write_spec/) allows batching in a single write request datapoints (samples) from multiple time-series, and multiple samples for each time-series. This the Protobuf definition of the single method exposed by the interface: + +``` +func Send(WriteRequest) + +message WriteRequest { + repeated TimeSeries timeseries = 1; + // [...] +} + +message TimeSeries { + repeated Label labels = 1; + repeated Sample samples = 2; +} + +message Label { + string name = 1; + string value = 2; +} + +message Sample { + double value = 1; + int64 timestamp = 2; +} +``` + +The POST payload is serialized as Protobuf 3, compressed with [Snappy](https://github.com/google/snappy). + +### Specifications constraints + +Specifications do not impose any constraints on repeating `TimeSeries` with the same set of `Label`, but it does require strict ordering of `Sample` belonging to the same time-series (by `timestamp`) and `Label` in a single `TimeSeries` (by name). + +{{< hint info >}} +The term "time-series" is overloaded, referring to both: +1. A unique series of samples in the datastore, identified by unique set of labels, + and +2. `TimeSeries` as a block of the `WriteRequest`. + +The two concepts are obviously related, but a `WriteRequest` may contain multiple `TimeSeries` elements referring to the same datastore *time-series*. +{{< /hint >}} + +### Prometheus design principles vs Flink design principles + +Flink is designed with a consistency-first approach. By default, any unexpected error causes the streaming job to fail and restart from the last checkpoint. + +In contrast, Prometheus is designed with an availability-first approach, prioritizing fast ingestion over strict consistency. When a write request contains malformed entries, the entire request must be discarded and not retried. Additionally, samples belonging to the same time series (with the same dimensions or labels) must be written in strict timestamp order. + +You may have already spotted the issue: any malformed or out-of-order samples can act as a “poison pill” unless you drop the offending request and proceed. + +### Strict specifications, lenient implementations + +The [Remote-Write specifications](https://prometheus.io/docs/specs/remote_write_spec/) are very strict regarding the [ordering of `Labels`](https://prometheus.io/docs/specs/remote_write_spec/#labels) within a request, the [ordering of `Samples`](https://prometheus.io/docs/specs/remote_write_spec/#ordering) belonging to the same time-series, and preventing from [retrying rejected requests](https://prometheus.io/docs/specs/remote_write_spec/#retries-backoff). + +However, many Prometheus backend implementations are more lenient. They may support a limited, configurable out-of-order time window and may not enforce strict label ordering as long as other rules are followed. + +As a result, a connector that strictly enforces the Remote-Write specifications may limit the user’s flexibility compared to what their specific Prometheus implementation allows. + +### Error-handling and guarantees + +As we have seen, the Remote-Write specifications require that certain rejected writes—specifically those related to malformed and out-of-order samples (typically 400 responses)—must not be retried. The connector has no choice but to discard the entire offending request and continue with the next record. The specifications do not define a standard response format, and there is no reliable way to automatically identify and selectively discard the offending samples. Review Comment: Let me clarify this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org