GitHub user adrian-ionescu opened a pull request:

    https://github.com/apache/spark/pull/18884

    [SPARK-21669] Internal API for collecting metrics/stats during 
FileFormatWriter jobs

    ## What changes were proposed in this pull request?
    
    This patch introduces an internal interface for tracking metrics and/or 
statistics on data on the fly, as it is being written to disk during a 
`FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it.
    
    The interface basically consists of 3 traits:
    - `WriteTaskStats`: just a tag for classes that represent statistics 
collected during a `WriteTask`
      The only constraint it adds is that the class should be `Serializable`, 
as instances of it will be collected on the driver from all executors at the 
end of the `WriteJob`.
    - `WriteTaskStatsTracker`: a trait for classes that can actually compute 
statistics based on tuples that are processed by a given `WriteTask` and 
eventually produce a `WriteTaskStats` instance.
    - `WriteJobStatsTracker`: a trait for classes that act as containers of 
`Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` 
on executors and finally process the resulting collection of `WriteTaskStats`, 
once they're gathered back on the driver.
    
    Potential future use of this interface is e.g. CBO stats maintenance during 
`INSERT INTO table ... ` operations.
    
    ## How was this patch tested?
    Existing tests for SPARK-20703 exercise the new code: 
`hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adrian-ionescu/apache-spark 
write-stats-tracker-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18884.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18884
    
----
commit 67e333e7abfd96b8f80bb0a128088d70f995d864
Author: Adrian Ionescu <adr...@databricks.com>
Date:   2017-08-07T12:57:02Z

    initial

commit 176726e7139121d0ffc9d0817b256b831a8c4fc8
Author: Adrian Ionescu <adr...@databricks.com>
Date:   2017-08-07T14:22:24Z

    tests pass; missing docs

commit 6f402468f72fcbdacc680dcae0fafb9fd340ad9f
Author: Adrian Ionescu <adr...@databricks.com>
Date:   2017-08-07T19:14:49Z

    newPartition() takes InternalRow instead of String

commit e6ab459501d70180d53a41dff69bdc13157df5a5
Author: Adrian Ionescu <adr...@databricks.com>
Date:   2017-08-08T12:56:54Z

    bug fix + docs

commit 3665f2fb4331012a022e9ae70cbe3d480ab8dcd3
Author: Adrian Ionescu <adr...@databricks.com>
Date:   2017-08-08T14:51:36Z

    minor

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to