[GitHub] spark pull request: [SPARK-8862][SPARK-8862][SQL] Add basic instru...

zsxwing Wed, 29 Jul 2015 21:03:07 -0700

GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/7774


    [SPARK-8862][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan 
operator and add a new SQL tab

    This PR includes the following changes:
    
    ### SPARK-8862: Add basic instrumentation to each SparkPlan operator
    
    A SparkPlan can override `def accumulators: Map[String, Accumulator[_]]` to 
expose its metrics that can be displayed in UI. The UI will use them to track 
the updates and show them in the web page in real-time.
    
    ### SparkSQLExecution and SQLSparkListener
    
    `SparkSQLExecution.withNewExecution` will set `spark.sql.execution.id` to 
the local properties so that we can use it to track all jobs that belong to the 
same query.
     
    SQLSparkListener is a listener to track all accumulator updates of all 
tasks for a query. It receives them from heartbeats can the UI can query them 
in real-time.
    
    When running a query, `SQLSparkListener.onExecutionStart` will be called. 
When a query is finished,  `SQLSparkListener.onExecutionEnd` will be called. 
And the Spark jobs with the same execution id will be tracked and stored with 
this query.
    
    `SQLSparkListener` has to store all accumulator updates for tasks 
separately. When a task fails and starts to retry, we need to drop the old 
accumulator updates. Because we can not revert our changes to an accumulator, 
we have to maintain these accumulator updates by ourselves so as to drop 
accumulator updates for a failed task. 
    
    TODO
    - [ ] Add unit tests for `SQLSparkListener`
    
    ### SPARK-8862: A new SQL tab
    Includes two pages: 
    #### A page for all DataFrame/SQL queries 
    It will show the running, completed and failed queries in 3 tables. It also 
displays the jobs and their links for a query in each row.
    #### A detail page for a DataFrame/SQL query
    In this page, it also shows the SparkPlan metrics in real-time. Run a 
long-running query, such as
    ```
    val testData = sc.parallelize((1 to 1000000).map(i => (i, 
i.toString))).toDF()
    testData.select($"_1").filter($"_1" < 1000).foreach(_ => Thread.sleep(60))
    ```
    and you will see the metrics keep updating in real-time.
    
    TODO
    - [ ] Add unit tests for new SQL pages
    
    This PR doesn't not finish all jobs for SPARK-8862. I'm working on the 
visualization for physical plan and will send another PR for it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark sql-ui

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7774
    
----
commit 23abf73cafac3af0363486bdae91d737e235a197
Author: zsxwing <[email protected]>
Date:   2015-07-28T13:49:10Z

    [SPARK-8862][SPARK-8862][SQL] Add basic instrumentation to each SparkPlan 
operator and add a new SQL tab
    
    A SparkPlan can override `def accumulators: Map[String, Accumulator[_]]` to 
expose its metrics that can be displayed in UI. The UI will use them to track 
the updates and show them in the web page in real-time.
    
    `SparkSQLExecution.withNewExecution` will set `spark.sql.execution.id` to 
the local properties so that we can use it to track all jobs that belong to the 
same query.
    
    SQLSparkListener is a listener to track all accumulator updates of all 
tasks for a query. It receives them from heartbeats can the UI can query them 
in real-time.
    
    When running a query, `SQLSparkListener.onExecutionStart` will be called. 
When a query is finished,  `SQLSparkListener.onExecutionEnd` will be called. 
And the Spark jobs with the same execution id will be tracked and stored with 
this query.
    
    `SQLSparkListener` has to store all accumulator updates for tasks 
separately. When a task fails and starts to retry, we need to drop the old 
accumulator updates. Because we can not revert our changes to an accumulator, 
we have to maintain these accumulator updates by ourselves so as to drop 
accumulator updates for a failed task.
    
    Includes two pages:
    It will show the running, completed and failed queries in 3 tables. It also 
displays the jobs and their links for a query in each row.
    In this page, it also shows the SparkPlan metrics in real-time. Run a 
long-running query, such as
    ```
    val testData = sc.parallelize((1 to 1000000).map(i => (i, 
i.toString))).toDF()
    testData.select($"_1").filter($"_1" < 1000).foreach(_ => Thread.sleep(60))
    ```
    and you will see the metrics keep updating in real-time.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8862][SPARK-8862][SQL] Add basic instru...

Reply via email to