[ 
https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26221:
--------------------------------
    Description: 
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing

Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time
 * Also track number of records for collect / limit

Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time
 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low

Shuffle
 * Track read time and write time
 * Decide if we can measure serialization and deserialization

Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.

Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query

Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.

  was:
This is an umbrella ticket for various small improvements for better metrics 
and instrumentation. Some thoughts:

 

Differentiate query plan that’s writing data out, vs returning data to the 
driver
 * I.e. ETL & report generation vs interactive analysis
 * This is related to the data sink item below. We need to make sure from the 
query plan we can tell what a query is doing





Data sink: Have an operator for data sink, with metrics that can tell us:
 * Write time
 * Number of records written
 * Size of output written
 * Number of partitions modified
 * Metastore update time

 * Also track number of records for collect / limit





Scan
 * Track file listing time (start and end so we can construct timeline, not 
just duration)
 * Track metastore operation time

 * Track IO decoding time for row-based input sources; Need to make sure 
overhead is low





Shuffle
 * Track read time and write time

 * Decide if we can measure serialization and deserialization





Client fetch time
 * Sometimes a query take long to run because it is blocked on the client 
fetching result (e.g. using a result iterator). Record the time blocked on 
client so we can remove it in measuring query execution time.





Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
single query





Better logging:
 * Enable logging the query execution id and TID in executor logs, and query 
execution id in driver logs.


> Improve Spark SQL instrumentation and metrics
> ---------------------------------------------
>
>                 Key: SPARK-26221
>                 URL: https://issues.apache.org/jira/browse/SPARK-26221
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>            Priority: Major
>
> This is an umbrella ticket for various small improvements for better metrics 
> and instrumentation. Some thoughts:
>  
> Differentiate query plan that’s writing data out, vs returning data to the 
> driver
>  * I.e. ETL & report generation vs interactive analysis
>  * This is related to the data sink item below. We need to make sure from the 
> query plan we can tell what a query is doing
> Data sink: Have an operator for data sink, with metrics that can tell us:
>  * Write time
>  * Number of records written
>  * Size of output written
>  * Number of partitions modified
>  * Metastore update time
>  * Also track number of records for collect / limit
> Scan
>  * Track file listing time (start and end so we can construct timeline, not 
> just duration)
>  * Track metastore operation time
>  * Track IO decoding time for row-based input sources; Need to make sure 
> overhead is low
> Shuffle
>  * Track read time and write time
>  * Decide if we can measure serialization and deserialization
> Client fetch time
>  * Sometimes a query take long to run because it is blocked on the client 
> fetching result (e.g. using a result iterator). Record the time blocked on 
> client so we can remove it in measuring query execution time.
> Make it easy to correlate queries with jobs, stages, and tasks belonging to a 
> single query
> Better logging:
>  * Enable logging the query execution id and TID in executor logs, and query 
> execution id in driver logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to