John Russell created IMPALA-6828:

             Summary: Expose more detailed info in profile for REFRESH
                 Key: IMPALA-6828
             Project: IMPALA
          Issue Type: Improvement
            Reporter: John Russell

Sometimes REFRESH is an expensive operation, and and I've heard anecdotally 
that it is sometimes run more often than necessary. I.e. a developer has no 
real way to know when it's worth doing a REFRESH, so they do it before every 
query, or in a loop for every table, or every N minutes.

I suggest beefing up the info displayed in the profile after a REFRESH is done. 
The idea being that if there was quantifiable info about how expensive the 
operation was, and whether a particular REFRESH was needed or was a no-op, that 
information could be used by cluster management software - e.g. the management 
software could have charts showing how much namenode load, network traffic, 
etc. was due to REFRESH; show what percentage of REFRESH was wasted effort 
because there was no new data; display a dashboard of tables showing which ones 
needed to be REFRESHed and letting that be handled by an administrator rather 
than hardcoded into SQL scripts or Impala applications.

Here's what I get if I side-load some HDFS files into a table, REFRESH, and 
look at the profile:

[localhost:21000] > select count(*) from t1;
| count(*) |
| 0        |
[localhost:21000] > refresh t1;

[localhost:21000] > summary;
Summary not available
[localhost:21000] > profile;
Query Runtime Profile:
Query (id=c0451cde0d827b96:e43381400000000):
    Session ID: 4f498d223ade18b1:893649710434fdb0
    Session Type: BEESWAX
    Start Time: 2018-04-09 15:17:48.116499000
    End Time: 2018-04-09 15:17:48.209089000
    Query Type: DDL
    Query State: FINISHED
    Query Status: OK
    Impala Version: impalad version 2.7.0-cdh5.9.0 RELEASE (build 
    User: jrussell
    Connected User: jrussell
    Delegated User: 
    Network Address: ::1:44170
    Default Db: scratch
    Sql Statement: refresh t1
    Coordinator: mumblemumble:22000
    Query Options (non default): 
    : 0.000ns
    Query Timeline: 94.081ms
       - Start execution: 278.360us (278.360us)
       - Planning finished: 1.209ms (931.528us)
       - Request finished: 90.333ms (89.123ms)
       - First row fetched: 91.762ms (1.429ms)
       - Unregister query: 92.510ms (747.664us)
     - CatalogOpExecTimer: 79.113ms
     - ClientFetchWaitTimer: 2.125ms
     - RowMaterializationTimer: 0.000ns

[localhost:21000] > select count(*) from t1;
| count(*) |
| 16777216 |

Instead of just the times for phases of execution, I suggest:

* How much memory used on each host.
* How many requests sent to namenode and/or Impala executors.
* Overall network traffic for the whole operation.
* Any other statistics about significant resources involved in doing REFRESH.
* Number of new, changed, or removed files detected. For the whole table, and 
per partition.
* Some obvious indication if the operation was a complete no-op, or if it was a 
no-op for N out of M partitions. I.e. to provide feedback that the operation 
wasn't needed at all, or could have been targeted better at only changed 

cc: [~alex.behm], [~dtsirogiannis]

This message was sent by Atlassian JIRA

Reply via email to