[ 
https://issues.apache.org/jira/browse/IMPALA-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-6828:
----------------------------------
    Component/s: Catalog

> Expose more detailed info in profile for REFRESH
> ------------------------------------------------
>
>                 Key: IMPALA-6828
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6828
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: John Russell
>            Priority: Major
>              Labels: supportability
>
> Sometimes REFRESH is an expensive operation, and and I've heard anecdotally 
> that it is sometimes run more often than necessary. I.e. a developer has no 
> real way to know when it's worth doing a REFRESH, so they do it before every 
> query, or in a loop for every table, or every N minutes.
> I suggest beefing up the info displayed in the profile after a REFRESH is 
> done. The idea being that if there was quantifiable info about how expensive 
> the operation was, and whether a particular REFRESH was needed or was a 
> no-op, that information could be used by cluster management software - e.g. 
> the management software could have charts showing how much namenode load, 
> network traffic, etc. was due to REFRESH; show what percentage of REFRESH was 
> wasted effort because there was no new data; display a dashboard of tables 
> showing which ones needed to be REFRESHed and letting that be handled by an 
> administrator rather than hardcoded into SQL scripts or Impala applications.
> Here's what I get if I side-load some HDFS files into a table, REFRESH, and 
> look at the profile:
> {code}
> [localhost:21000] > select count(*) from t1;
> +----------+
> | count(*) |
> +----------+
> | 0        |
> +----------+
> [localhost:21000] > refresh t1;
> [localhost:21000] > summary;
> Summary not available
> [localhost:21000] > profile;
> Query Runtime Profile:
> Query (id=c0451cde0d827b96:e43381400000000):
>   Summary:
>     Session ID: 4f498d223ade18b1:893649710434fdb0
>     Session Type: BEESWAX
>     Start Time: 2018-04-09 15:17:48.116499000
>     End Time: 2018-04-09 15:17:48.209089000
>     Query Type: DDL
>     Query State: FINISHED
>     Query Status: OK
>     Impala Version: impalad version 2.7.0-cdh5.9.0 RELEASE (build 
> 4b4cf1936bd6cdf34fda5e2f32827e7d60c07a9c)
>     User: jrussell
>     Connected User: jrussell
>     Delegated User: 
>     Network Address: ::1:44170
>     Default Db: scratch
>     Sql Statement: refresh t1
>     Coordinator: mumblemumble:22000
>     Query Options (non default): 
>     DDL Type: RESET_METADATA
>     : 0.000ns
>     Query Timeline: 94.081ms
>        - Start execution: 278.360us (278.360us)
>        - Planning finished: 1.209ms (931.528us)
>        - Request finished: 90.333ms (89.123ms)
>        - First row fetched: 91.762ms (1.429ms)
>        - Unregister query: 92.510ms (747.664us)
>   ImpalaServer:
>      - CatalogOpExecTimer: 79.113ms
>      - ClientFetchWaitTimer: 2.125ms
>      - RowMaterializationTimer: 0.000ns
> [localhost:21000] > select count(*) from t1;
> +----------+
> | count(*) |
> +----------+
> | 16777216 |
> +----------+
> {code}
> Instead of just the times for phases of execution, I suggest:
> * How much memory used on each host.
> * How many requests sent to namenode and/or Impala executors.
> * Overall network traffic for the whole operation.
> * Any other statistics about significant resources involved in doing REFRESH.
> * Number of new, changed, or removed files detected. For the whole table, and 
> per partition.
> * Some obvious indication if the operation was a complete no-op, or if it was 
> a no-op for N out of M partitions. I.e. to provide feedback that the 
> operation wasn't needed at all, or could have been targeted better at only 
> changed partitions.
> cc: [~alex.behm], [~dtsirogiannis]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to