Re: [METRICS] Metrics names inconsistent between executions

2019-05-07 Thread Stavros Kontopoulos
Hi,

With jmx_exporter  and
Prometheus you can always re-write the metrics patterns on the fly. Btw if
you use Grafana its easy to filter things even without the re-write.
If this is a custom dashboard you can always group metrics based on the
spark.app.id as a prefix, no? Also I think some times its good to know if
some executor
failed and why and report specific execution metrics. For example if you
have skewed data and that caused jvm issues etc.

Stavros
On Mon, May 6, 2019 at 11:29 PM Anton Kirillov 
wrote:

> Hi everyone!
>
> We are currently working on building a unified monitoring/alerting
> solution for Spark and would like to rely on Spark's own metrics to avoid
> divergence from the upstream. One of the challenges is to support metrics
> coming from multiple Spark applications running on a cluster: scheduled
> jobs, long-running streaming applications etc.
>
> Original problem:
> Spark assigns metrics names using *spark.app.id *
> and *spark.executor.id * as a part of them.
> Thus the number of metrics is continuously growing because those IDs are
> unique between executions whereas the metrics themselves report the same
> thing. Another issue which arises here is how to use constantly changing
> metric names in dashboards.
>
> For example, *jvm_heap_used* reported by all Spark instances (components):
> - _driver_jvm_heap_used (Driver)
> - __jvm_heap_used (Executors)
>
> While *spark.app.id * can be overridden with
> *spark.metrics.namespace*, there's no such an option for *spark.executor.id
> * which makes it impossible to build a reusable
> dashboard because (given the uniqueness of IDs) differently named metrics
> are emitted for each execution.
>
> One of the possible solutions would be to make executor metrics names
> follow the driver's metrics name pattern, e.g.:
> - _driver_jvm_heap_used (Driver)
> - _executor_jvm_heap_used (Executors)
>
> and distinguish executors based on tags (tags should be configured in
> metric reporters in this case). Not sure if this could potentially break
> Driver UI though.
>
> I'd really appreciate any feedback on this issue and would be happy to
> create a Jira issue/PR if this change looks sane for the community.
>
> Thanks in advance.
>
> --
> *Anton Kirillov*
> Senior Software Engineer, Mesosphere
>


[METRICS] Metrics names inconsistent between executions

2019-05-06 Thread Anton Kirillov
Hi everyone!

We are currently working on building a unified monitoring/alerting solution
for Spark and would like to rely on Spark's own metrics to avoid divergence
from the upstream. One of the challenges is to support metrics coming from
multiple Spark applications running on a cluster: scheduled jobs,
long-running streaming applications etc.

Original problem:
Spark assigns metrics names using *spark.app.id *
and *spark.executor.id
* as a part of them. Thus the number of metrics
is continuously growing because those IDs are unique between executions
whereas the metrics themselves report the same thing. Another issue which
arises here is how to use constantly changing metric names in dashboards.

For example, *jvm_heap_used* reported by all Spark instances (components):
- _driver_jvm_heap_used (Driver)
- __jvm_heap_used (Executors)

While *spark.app.id * can be overridden with
*spark.metrics.namespace*, there's no such an option for *spark.executor.id
* which makes it impossible to build a reusable
dashboard because (given the uniqueness of IDs) differently named metrics
are emitted for each execution.

One of the possible solutions would be to make executor metrics names
follow the driver's metrics name pattern, e.g.:
- _driver_jvm_heap_used (Driver)
- _executor_jvm_heap_used (Executors)

and distinguish executors based on tags (tags should be configured in
metric reporters in this case). Not sure if this could potentially break
Driver UI though.

I'd really appreciate any feedback on this issue and would be happy to
create a Jira issue/PR if this change looks sane for the community.

Thanks in advance.

-- 
*Anton Kirillov*
Senior Software Engineer, Mesosphere