GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/18159
[SPARK-20703][SQL][WIP] Associate metrics with data writes onto
DataFrameWriter operations
## What changes were proposed in this pull request?
Right now in the UI, after SPARK-20213, we can show the operations to write
data out. However, there is no way to associate metrics with data writes. We
should show relative metrics on the operations.
### The Approach
We have several paths for writing data out through some commands.
#### File-based: `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`
The two commands use `FileFormatWriter` to write out data files. This patch
record some metrics in `FileFormatWriter` and pass into the callback function
for updating metrics in `SparkPlan`.
* number of written files
* number of dynamic partitions
* bytes of written files
* number of output rows
* writing data out time (ms)
#### Other datasources: `InsertIntoDataSourceCommand`,
`SaveIntoDataSourceCommand`
For other datasource relations, the logic of writing data out is delegated
to the datasource implementations, e.g., `InsertableRelation.insert`,
`CreatableRelationProvider.createRelation`. So we can't obtain metrics from
delegated methods for now.
#### `CreateDataSourceTableAsSelectCommand` and
`CreateHiveTableAsSelectCommand`:
The two commands creates and invokes other commands
(`InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`). Although we
support recording metrics for the invoked commands, however, the metrics are
recorded in the invoked `SparkPlan`, instead of the commands invoking them. So
we can't show metrics for the two commands for now.
## How was this patch tested?
Updated unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-20703-2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18159.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18159
----
commit a6efa7380591498a000ee89eff8460e9e6463f9d
Author: Liang-Chi Hsieh <[email protected]>
Date: 2017-05-31T08:03:08Z
Support to show data writing metrics.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]