[ 
https://issues.apache.org/jira/browse/FLINK-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985965#comment-16985965
 ] 

Theo Diefenthal commented on FLINK-13418:
-----------------------------------------

I think that the major concern here is the bridge between Flink and InfluxDB 
and the problem in the end comes down to the reason why we use metrics at all:

We usually use metrics to find and explain problems when/after they occurred. 
It is thus espcially important for a metric system to be stable on application 
crash/problems.

Currently, if my job breaks for some reason and tends to restart very often, it 
will soon after crash influxDB as already explained above. We _could_ limit the 
number of restarts somehow, but for me, I really want my jobs to try restarting 
all the time as I usually expect some partner system to be down and don't have 
a failure in my application code causing continuos restarts.

So the other option is that when doing restarts, InfluxDB memory requirements 
should not grow indefinitely which thus means that we need to keep the tag 
cardinality constant. (BTW Thanks [~yunta] for pointing me to tsi1 which 
reduced our problems a lot, but not completely). In my case when properly 
assigning task names and ids and using Flink on YARN, I observe the following 
problematic tags, i.e. tags with high cardinality and growing on 
restart/reschedule, ordered by cardinality desc. 
{code:java}
task_attempt_id
tm_id
job_id
task_attempt_num
{code}
For those tags, it would be great if we could disable them or store them as a 
field, at best configurable. I know that storing them as a field would cause 
much storage overhead and losing the index, but we could compute the storage 
capacity beforehand and plan our resources. In case of tags, they can just 
explode unexpectedly on application crash without any resource limitations, 
just limited on how fast the application restarts.

> Avoid InfluxdbReporter to report unnecessary tags
> -------------------------------------------------
>
>                 Key: FLINK-13418
>                 URL: https://issues.apache.org/jira/browse/FLINK-13418
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics
>            Reporter: Yun Tang
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently, when building measurement info within {{InfluxdbReporter}}, it 
> would involve all variables as tags (please see code 
> [here|https://github.com/apache/flink/blob/d57741cef9d4773cc487418baa961254d0d47524/flink-metrics/flink-metrics-influxdb/src/main/java/org/apache/flink/metrics/influxdb/MeasurementInfoProvider.java#L54]).
>  However, user could adjust their own scope format to abort unnecessary 
> scope, while {{InfluxdbReporter}} could report all the scopes as tags to 
> InfluxDB.
> This is due to current {{MetricGroup}} lacks of any method to get necessary 
> scopes but only {{#getScopeComponents()}} or {{#getAllVariables()}}. In other 
> words, InfluxDB need tag-key and tag-value to compose as its tags while we 
> could only get all variables (without any filter acording to scope format) or 
> only scopeComponents (could be treated as tag-value). I think that's why 
> previous implementation have to report all tags.
> From our experience on InfluxDB, as the size of tags contribute to the 
> overall series in InfluxDB, it would never be a good idea to contain too many 
> tags, not to mention the [default value of series per 
> database|https://docs.influxdata.com/influxdb/v1.7/troubleshooting/errors/#error-max-series-per-database-exceeded]
>  is only one million.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to