I'd like to use the SparkListenerInterface to listen for some metrics for
monitoring/logging/metadata purposes. The first ones I'm interested in
hooking into are recordsWritten and bytesWritten as a measure of throughput.
I'm using PySpark to write Parquet files from DataFrames.

I'm able to extract a rich set of metrics this way, but for some reason the
two that I want are always 0. This mirrors what I see in the Spark
Application Master - the # records written field is always missing.

I've filed a JIRA already for this issue:
https://issues.apache.org/jira/browse/SPARK-22605

I _think_ how this works is that inside the ResultTask.runTask method, the
rdd.iterator call is incrementing the bytes read & records read via
RDD.getOrCompute. Where would the equivalent be for the records written
metrics?

These metrics are populated properly if I save the data as an RDD via
df.rdd.saveAsTextFile, so the code path exists somewhere. Any hints as to
where I should be looking?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to