snmvaughan opened a new pull request, #46188:
URL: https://github.com/apache/spark/pull/46188
We currently capture metrics which include the number of files, bytes and
rows for a task along with the updated partitions.
This change captures metrics for each updated partition, reporting the
partition sub-paths along with the number of files, bytes, and rows per
partition for each task.
### What changes were proposed in this pull request?
1. Update the `WriteTaskStatsTracker` implementation to associate a
partition with the file during writing, and to track the number of rows written
to each file. The final stats now include a map of partitions and the
associated partition stats
2. Update the `WriteJobStatsTracker` implementation to capture the partition
subpaths and to publish a new Event to the listener bus. The processed stats
aggregate the statistics for each partition
### Why are the changes needed?
This increases our understanding of written data by tracking the impact for
each task on our datasets
### Does this PR introduce _any_ user-facing change?
This makes partition-level data accessible through a new event.
### How was this patch tested?
In addition to the new unit tests, this was run in a Kubernetes environment
writing tables with differing partitioning strategies and validating the
reported stats. Unit tests using both `InsertIntoHadoopFsRelationCommand` and
`InsertIntoHiveTable` now verify partition stats when dynamic partitioning is
enabled. We also verified that the aggregated partition metrics matched the
existing metrics for number of files, bytes, and rows.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]