(flink-web) branch asf-site updated: Update blogpost for Prometheus connector

dannycranmer Mon, 06 Jan 2025 01:49:32 -0800

This is an automated email from the ASF dual-hosted git repository.

dannycranmer pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/flink-web.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new f95c24752 Update blogpost for Prometheus connector
f95c24752 is described below

commit f95c247521c5a28b8ea1a55f181f274ff9ede0eb
Author: Anthony Pounds-Cornish <an...@amazon.co.uk>
AuthorDate: Fri Dec 6 17:18:23 2024 +0000

    Update blogpost for Prometheus connector
---
 ...024-12-05-introducing-new-prometheus-connector.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git 
a/docs/content/posts/2024-12-05-introducing-new-prometheus-connector.md 
b/docs/content/posts/2024-12-05-introducing-new-prometheus-connector.md
index fa1fe6e5d..a529fce1e 100644
--- a/docs/content/posts/2024-12-05-introducing-new-prometheus-connector.md
+++ b/docs/content/posts/2024-12-05-introducing-new-prometheus-connector.md
@@ -15,7 +15,7 @@ This connector allows writing data to Prometheus using the 
[Remote-Write](https:
 
 Prometheus is an efficient time-series database optimized for building 
real-time dashboards and alerts, typically in combination with Grafana or other 
visualization tools.
 
-Prometheus is commonly used for observability, to monitor compute resources, 
Kubernetes clusters, and applications. It can also be used to observe Flink 
clusters an jobs. The Flink [Metric 
Reporters](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#prometheus)
 has exactly this purpose.
+Prometheus is commonly used for observability, to monitor compute resources, 
Kubernetes clusters, and applications. It can also be used to observe Flink 
clusters and jobs. The Flink [Metric 
Reporter](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#prometheus)
 has exactly this purpose.
 
 So, why do we need a connector?
 
@@ -27,11 +27,11 @@ Observability data from these use cases differs from 
metrics generated by comput
 * **Out-of-order events**: Devices may be connected via mobile networks or 
even Bluetooth. Events from different devices may follow different paths and 
arrive at very different times. A **stateful, event-time logic** can be used to 
reorder them.
 * **High frequency** and **high cardinality**: You can have a sheer number of 
devices, each emitting signals multiple times per second. **Aggregating over 
time** and **over dimensions** can reduce frequency and cardinality and make 
the volume of data more efficiently analysable.
 * **Lack of contextual information**: Raw events sent by the devices often 
lack of contextual information for a meaningful analysis. **Enrichment** of raw 
events, looking up some reference dataset, can be used to add dimensions useful 
for the analysis.
-* **Noise**: sensor measurement may contain noise. For example when a GPS 
tracker lose connection and reports spurious positions. These obvious outliers 
can be **filtered** out to simplify visualization and analysis.
+* **Noise**: sensor measurement may contain noise. For example when a GPS 
tracker loses connection and reports spurious positions. These obvious outliers 
can be **filtered** out to simplify visualization and analysis.
 
 Flink can be used as a pre-processor to address all the above. 
 
-You can implement a sink from scratch or use AsyncIO to call the Prometheus 
Remote-Write endpoint. However, there are not-trivial details to implement an 
efficient Remote-Write client:
+You can implement a sink from scratch or use AsyncIO to call the Prometheus 
Remote-Write endpoint. However, there are non-trivial details to implement an 
efficient Remote-Write client:
 * There is no high-level client for Prometheus Remote-Write. You would need to 
build on top of a low-level HTTP client.
 * Remote-Write can be inefficient unless write requests are batched and 
parallelized.
 * Error handling can be complex, and specifications demand strict behaviors 
(see [Strict Specifications, Lenient 
Implementations](#strict-specifications-lenient-implementations)).
@@ -111,7 +111,7 @@ Specifications do not impose any constraints on repeating 
`TimeSeries` with the
 
 {{< hint info >}}
 The term "time-series" is overloaded, referring to both:
-1. A unique series of samples in the datastore, identified by unique set of 
labels,
+1. A unique series of samples in the datastore, identified by a unique set of 
labels,
     and
 2. `TimeSeries` as a block of the `WriteRequest`.
 
@@ -122,7 +122,7 @@ The two concepts are obviously related, but a 
`WriteRequest` may contain multipl
 
 Flink is designed with a consistency-first approach. By default, any 
unexpected error causes the streaming job to fail and restart from the last 
checkpoint.
 
-In contrast, Prometheus is designed with an availability-first approach, 
prioritizing fast ingestion over strict consistency. When a write request 
contains malformed entries, the entire request must be discarded and not 
retried. If you retry, Prometheus will keep rejecting your write, so there is 
no point of doing it. 
+In contrast, Prometheus is designed with an availability-first approach, 
prioritizing fast ingestion over strict consistency. When a write request 
contains malformed entries, the entire request must be discarded and not 
retried. If you retry, Prometheus will keep rejecting your write, so there is 
no point doing it. 
 Additionally, samples belonging to the same time series (with the same 
dimensions or labels) must be written in strict timestamp order.
 
 You may have already spotted the issue: any malformed or out-of-order samples 
can act as a “poison pill” unless you drop the offending request and proceed.
@@ -178,14 +178,14 @@ In particular, the unhappy scenarios that have been 
tested include:
 * Malformed (label names violating the specifications), and out of order 
writes.
 * Restart from savepoint and from checkpoint.
 * Behavior under backpressure, simulated exceeding the destination Prometheus 
quota, letting the connector retrying forever after being throttled.
-* Maximum number of retries is exceeded, also simulated via Prometheus 
throttling, but with a low maximum reties. The connector behavior in this case 
is configurable, so both “fail” (job fails and restart) and “discard and 
continue” behaviors have been tested.
+* Maximum number of retries is exceeded, also simulated via Prometheus 
throttling, but with a low maximum retries value. The connector behavior in 
this case is configurable, so both “fail” (job fails and restart) and “discard 
and continue” behaviors have been tested.
 * Remote-write endpoint is not reachable.
 * Remote-write authentication fails.
 * Inconsistent configuration, such as (numeric) parameters outside the 
expected range.
 
-To facilitate testing we created a data generator, a separate Flink 
application, capable of generating semi-random data and of introducing specific 
errors, like out-of-order samples. The generator can also produce a high volume 
of data for load and stress testing.  One aspect of the test harness requiring 
special attention was not loosing ordering. For example, we used Kafka between 
the generator and the writer application. Kinesis was not an option due to its 
lack of strict ordering gua [...]
+To facilitate testing we created a data generator, a separate Flink 
application, capable of generating semi-random data and of introducing specific 
errors, like out-of-order samples. The generator can also produce a high volume 
of data for load and stress testing.  One aspect of the test harness requiring 
special attention was not losing ordering. For example, we used Kafka between 
the generator and the writer application. Kinesis was not an option due to its 
lack of strict ordering guarantees.
 
-Finally, the connector was also stress-tested, writing up to 1,000,000 samples 
per second with 1,000,000 cardinality (distinct time-series). 1 million is 
where we stopped testing, not a limit of the connector itself. This throughput 
has been achieved with parallelism between 24 and 32, depending on the number 
of samples per input record. Obviously, your mileage may vary, depending on 
your Prometheus backend, the samples per record,and the number of dimensions.
+Finally, the connector was also stress-tested, writing up to 1,000,000 samples 
per second with 1,000,000 cardinality (distinct time-series). 1 million is 
where we stopped testing, not a limit of the connector itself. This throughput 
has been achieved with parallelism between 24 and 32, depending on the number 
of samples per input record. Obviously, your mileage may vary, depending on 
your Prometheus backend, the samples per record, and the number of dimensions.
 
 ## Future improvements
 
@@ -193,7 +193,7 @@ There are a couple of obvious improvements for future 
releases:
 1. Table API/SQL support
 2. Optional validation of input data
 
-Both these features have been actually considered in the first release, but 
excluded due to the challenges that would pose. Let's go through some of these 
considerations.
+Both these features were considered for the first release, but excluded due to 
the challenges they would pose. Let's go through some of these considerations.
 
 ### Consideration about Table API /SQL interface
 
@@ -207,7 +207,7 @@ A user-friendly Table API implementation would require 
flattening this structure
 
 Data validation would be a convenient feature, allowing to discard or send to 
a "dead letter queue" invalid records, before the sink attempts the write 
request that would be rejected by Prometheus. This would reduce data loss in 
case of malformed data, because Prometheus rejects the entire write request 
(the batch) regardless of the number of offending records.
 
-However, validating well-formed input would come with a significant 
performance cost. It would require checking every `Label` with a regular 
expressions, and check the ordering of the list of `Labels`, on every single 
input records.
+However, validating well-formed input would come with a significant 
performance cost. It would require checking every `Label` with a regular 
expression, and checking the ordering of the list of `Labels`, on every single 
input record.
 
 Additionally, checking `Sample` ordering in the sink would not allow 
reordering, unless you introduce some form of longer windowing that would 
inevitably increase latency. If latency is not a problem, some form of 
reordering can be implemented by the user, upstream of the connector.

(flink-web) branch asf-site updated: Update blogpost for Prometheus connector

Reply via email to