[
https://issues.apache.org/jira/browse/SPARK-35502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mati updated SPARK-35502:
-------------------------
Description:
Recently we have enabled prometheusServlet configuration in order to have spark
master, worker, driver and executor metrics.
We can see and using spark master, worker and driver executors but can't see
spark executor metrics.
We are running spark streaming standalone cluster in version 3.0.1 over
physical servers.
We have taken one of our jobs and added the following parameters to the job
configuration, but couldn't see executer metrics by curling both driver and
executor workers of this job:
These are the parameters:
--conf spark.ui.prometheus.enabled=true \
--conf spark.executor.processTreeMetrics.enabled=true
Curl commands:
[00764f](root@sparktest-40005-prod-chidc2:~)# curl -s
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
-n5
[00764f](root@sparktest-40005-prod-chidc2:~)#
Driver of this job - sparktest-40004:
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
-n5
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
-n5
Our UI port is on 4050
I understood that the executor Prometheus endpoint is still experimental
which may explain the inconsistent behaviour we see but is there a plan to fix
it?
Are there any known issues regarding this?
was:
Recently we have enabled prometheusServlet configuration in order to have spark
master, worker, driver and executor metrics.
We can see and using spark master, worker and driver executors but can't see
spark executor metrics.
We are running spark streaming standalone cluster in version 3.0.1 over
physical servers.
We have taken one of our jobs and added the following parameters to the job
configuration, but couldn't see executer metrics by curling both driver and
executor workers of this job:
These are the parameters:
--conf spark.ui.prometheus.enabled=true \
--conf spark.executor.processTreeMetrics.enabled=true
Curl commands:
[00764f](root@sparktest-40005-prod-chidc2:~)# curl -s
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
-n5
[00764f](root@sparktest-40005-prod-chidc2:~)#
Driver of this job - sparktest-40004:
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
-n5
[e35005](root@sparktest-40004-prod-chidc2:~)# curl -s
[http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
-n5
Out UI port is on 4050
I understood that the executor Prometheus endpoint is still experimental which
may explain the inconsistent behaviour we see but is there a plan to fix it?
Are there any known issues regarding this?
Environment:
metrics.properties
{code:java}
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus ##
Below may be removed after finalizing the native Prometheus implementation #
# Enable Prometheus for driver
#
driver.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
driver.sink.prometheus_chidc2.report-instance-id=false
# Prometheus pushgateway address
driver.sink.prometheus_chidc2.pushgateway-address-protocol=http
driver.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091
driver.sink.prometheus_chidc2.period=60
driver.sink.prometheus_chidc2.pushgateway-enable-timestamp=false
driver.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2
driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001-prod-chidc2.chidc2.outbrain.com
driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
driver.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
driver.sink.prometheus_nydc1.report-instance-id=false
# Prometheus pushgateway address
driver.sink.prometheus_nydc1.pushgateway-address-protocol=http
driver.sink.prometheus_nydc1.pushgateway-address=pushgateway ...
driver.sink.prometheus_nydc1.period=60
driver.sink.prometheus_nydc1.pushgateway-enable-timestamp=false
driver.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2
driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001
driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
#
# Enable Prometheus for executor
#
executor.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
executor.sink.prometheus_chidc2.report-instance-id=false
# Prometheus pushgateway address
executor.sink.prometheus_chidc2.pushgateway-address-protocol=http
executor.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091
executor.sink.prometheus_chidc2.period=60
executor.sink.prometheus_chidc2.pushgateway-enable-timestamp=false
executor.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2
executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001
executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
executor.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
executor.sink.prometheus_nydc1.report-instance-id=false
# Prometheus pushgateway address
executor.sink.prometheus_nydc1.pushgateway-address-protocol=http
executor.sink.prometheus_nydc1.pushgateway-address=pushgateway-spark-master
executor.sink.prometheus_nydc1.period=60
executor.sink.prometheus_nydc1.pushgateway-enable-timestamp=false
executor.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2
executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001
executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
{code}
spark-default.conf
{code:java}
# Configured by Chef via recipe: ob-spark-hadoop::install_v3# Configured by
Chef via recipe: ob-spark-hadoop::install_v3## Default system properties
included when running spark-submit.# This is useful for setting default
environmental settings.
# Log effective Spark configuration at startup on INFO levelspark.logConf
true
spark.ui.port 4050
# spark-extras#spark.driver.extraClassPath
/opt/spark-extras#spark.executor.extraClassPath
/opt/spark-extrasspark.metrics.namespace ${spark.app.name}
# Enable event logs for HistoryServerspark.eventLog.enabled
truespark.eventLog.dir
hdfs:///apps/sparkspark.eventLog.compress
truespark.history.fs.logDirectory hdfs:///apps/spark
spark.history.fs.cleaner.enabled truespark.history.fs.cleaner.interval
1dspark.history.fs.cleaner.maxAge 7d
spark.master.rest.enabled truespark.master
spark://spark-apache-master-v3-test.service.consul:7077spark.serializer
org.apache.spark.serializer.KryoSerializerspark.shuffle.compress
truespark.shuffle.spill.compress truespark.shuffle.service.enabled
truespark.executor.memory 1g
# Spark streaming tuningsspark.streaming.blockInterval
200msspark.streaming.kafka.maxRetries 2
# Cleanupsspark.worker.cleanup.enabled truespark.worker.cleanup.interval 3600
# Spark executorsspark.executor.logs.rolling.enableCompression
truespark.executor.logs.rolling.maxRetainedFiles
5spark.executor.logs.rolling.strategy sizespark.executor.logs.rolling.maxSize
100000
# Spark HA configurationsspark.deploy.recoveryMode=ZOOKEEPER
testspark.deploy.zookeeper.dir=/spark
# Spark Prometheus settings for executors
spark.ui.prometheus.enabled truespark.executor.processTreeMetrics.enabled true
spark.local.dir=/outbrain/Prod/spark/c,/outbrain/Prod/spark/d
{code}
> Spark Executor metrics are not produced/showed
> ----------------------------------------------
>
> Key: SPARK-35502
> URL: https://issues.apache.org/jira/browse/SPARK-35502
> Project: Spark
> Issue Type: Bug
> Components: Spark Submit
> Affects Versions: 3.0.1
> Environment: metrics.properties
> {code:java}
>
> *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
> *.sink.prometheusServlet.path=/metrics/prometheus
> master.sink.prometheusServlet.path=/metrics/master/prometheus
> applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> ## Below may be removed after finalizing the native Prometheus implementation
> #
> # Enable Prometheus for driver
> #
> driver.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
> driver.sink.prometheus_chidc2.report-instance-id=false
> # Prometheus pushgateway address
> driver.sink.prometheus_chidc2.pushgateway-address-protocol=http
>
> driver.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091
> driver.sink.prometheus_chidc2.period=60
> driver.sink.prometheus_chidc2.pushgateway-enable-timestamp=false
>
> driver.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2
>
> driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001-prod-chidc2.chidc2.outbrain.com
>
> driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
>
> driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
>
> driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
>
> driver.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
> driver.sink.prometheus_nydc1.report-instance-id=false
> # Prometheus pushgateway address
> driver.sink.prometheus_nydc1.pushgateway-address-protocol=http
> driver.sink.prometheus_nydc1.pushgateway-address=pushgateway ...
> driver.sink.prometheus_nydc1.period=60
> driver.sink.prometheus_nydc1.pushgateway-enable-timestamp=false
>
> driver.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2
> driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001
>
> driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
>
> driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
>
> driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
> #
> # Enable Prometheus for executor
> #
> executor.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
> executor.sink.prometheus_chidc2.report-instance-id=false
> # Prometheus pushgateway address
> executor.sink.prometheus_chidc2.pushgateway-address-protocol=http
>
> executor.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091
> executor.sink.prometheus_chidc2.period=60
> executor.sink.prometheus_chidc2.pushgateway-enable-timestamp=false
>
> executor.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2
>
> executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001
>
> executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
>
> executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
>
> executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
>
> executor.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
> executor.sink.prometheus_nydc1.report-instance-id=false
> # Prometheus pushgateway address
> executor.sink.prometheus_nydc1.pushgateway-address-protocol=http
> executor.sink.prometheus_nydc1.pushgateway-address=pushgateway-spark-master
> executor.sink.prometheus_nydc1.period=60
> executor.sink.prometheus_nydc1.pushgateway-enable-timestamp=false
>
> executor.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2
>
> executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001
>
> executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+)
>
> executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2
>
> executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops
> {code}
> spark-default.conf
> {code:java}
> # Configured by Chef via recipe: ob-spark-hadoop::install_v3# Configured by
> Chef via recipe: ob-spark-hadoop::install_v3## Default system properties
> included when running spark-submit.# This is useful for setting default
> environmental settings.
> # Log effective Spark configuration at startup on INFO levelspark.logConf
> true
> spark.ui.port 4050
> # spark-extras#spark.driver.extraClassPath
> /opt/spark-extras#spark.executor.extraClassPath
> /opt/spark-extrasspark.metrics.namespace ${spark.app.name}
> # Enable event logs for HistoryServerspark.eventLog.enabled
> truespark.eventLog.dir
> hdfs:///apps/sparkspark.eventLog.compress
> truespark.history.fs.logDirectory hdfs:///apps/spark
> spark.history.fs.cleaner.enabled truespark.history.fs.cleaner.interval
> 1dspark.history.fs.cleaner.maxAge 7d
> spark.master.rest.enabled truespark.master
> spark://spark-apache-master-v3-test.service.consul:7077spark.serializer
> org.apache.spark.serializer.KryoSerializerspark.shuffle.compress
> truespark.shuffle.spill.compress truespark.shuffle.service.enabled
> truespark.executor.memory 1g
> # Spark streaming tuningsspark.streaming.blockInterval
> 200msspark.streaming.kafka.maxRetries 2
> # Cleanupsspark.worker.cleanup.enabled truespark.worker.cleanup.interval
> 3600
> # Spark executorsspark.executor.logs.rolling.enableCompression
> truespark.executor.logs.rolling.maxRetainedFiles
> 5spark.executor.logs.rolling.strategy sizespark.executor.logs.rolling.maxSize
> 100000
> # Spark HA configurationsspark.deploy.recoveryMode=ZOOKEEPER
> testspark.deploy.zookeeper.dir=/spark
> # Spark Prometheus settings for executors
> spark.ui.prometheus.enabled truespark.executor.processTreeMetrics.enabled true
> spark.local.dir=/outbrain/Prod/spark/c,/outbrain/Prod/spark/d
> {code}
>
>
> Reporter: Mati
> Priority: Major
>
> Recently we have enabled prometheusServlet configuration in order to have
> spark master, worker, driver and executor metrics.
> We can see and using spark master, worker and driver executors but can't see
> spark executor metrics.
> We are running spark streaming standalone cluster in version 3.0.1 over
> physical servers.
>
> We have taken one of our jobs and added the following parameters to the job
> configuration, but couldn't see executer metrics by curling both driver and
> executor workers of this job:
>
> These are the parameters:
> --conf spark.ui.prometheus.enabled=true \
> --conf spark.executor.processTreeMetrics.enabled=true
>
> Curl commands:
> [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s
> [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
> -n5
> [00764f](root@sparktest-40005-prod-chidc2:~)#
> Driver of this job - sparktest-40004:
> [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s
> [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
> -n5
> [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s
> [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead]
> -n5
>
> Our UI port is on 4050
>
> I understood that the executor Prometheus endpoint is still experimental
> which may explain the inconsistent behaviour we see but is there a plan to
> fix it?
>
> Are there any known issues regarding this?
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]