John Russell created IMPALA-6828: ------------------------------------ Summary: Expose more detailed info in profile for REFRESH Key: IMPALA-6828 URL: https://issues.apache.org/jira/browse/IMPALA-6828 Project: IMPALA Issue Type: Improvement Reporter: John Russell
Sometimes REFRESH is an expensive operation, and and I've heard anecdotally that it is sometimes run more often than necessary. I.e. a developer has no real way to know when it's worth doing a REFRESH, so they do it before every query, or in a loop for every table, or every N minutes. I suggest beefing up the info displayed in the profile after a REFRESH is done. The idea being that if there was quantifiable info about how expensive the operation was, and whether a particular REFRESH was needed or was a no-op, that information could be used by cluster management software - e.g. the management software could have charts showing how much namenode load, network traffic, etc. was due to REFRESH; show what percentage of REFRESH was wasted effort because there was no new data; display a dashboard of tables showing which ones needed to be REFRESHed and letting that be handled by an administrator rather than hardcoded into SQL scripts or Impala applications. Here's what I get if I side-load some HDFS files into a table, REFRESH, and look at the profile: {code} [localhost:21000] > select count(*) from t1; +----------+ | count(*) | +----------+ | 0 | +----------+ [localhost:21000] > refresh t1; [localhost:21000] > summary; Summary not available [localhost:21000] > profile; Query Runtime Profile: Query (id=c0451cde0d827b96:e43381400000000): Summary: Session ID: 4f498d223ade18b1:893649710434fdb0 Session Type: BEESWAX Start Time: 2018-04-09 15:17:48.116499000 End Time: 2018-04-09 15:17:48.209089000 Query Type: DDL Query State: FINISHED Query Status: OK Impala Version: impalad version 2.7.0-cdh5.9.0 RELEASE (build 4b4cf1936bd6cdf34fda5e2f32827e7d60c07a9c) User: jrussell Connected User: jrussell Delegated User: Network Address: ::1:44170 Default Db: scratch Sql Statement: refresh t1 Coordinator: mumblemumble:22000 Query Options (non default): DDL Type: RESET_METADATA : 0.000ns Query Timeline: 94.081ms - Start execution: 278.360us (278.360us) - Planning finished: 1.209ms (931.528us) - Request finished: 90.333ms (89.123ms) - First row fetched: 91.762ms (1.429ms) - Unregister query: 92.510ms (747.664us) ImpalaServer: - CatalogOpExecTimer: 79.113ms - ClientFetchWaitTimer: 2.125ms - RowMaterializationTimer: 0.000ns [localhost:21000] > select count(*) from t1; +----------+ | count(*) | +----------+ | 16777216 | +----------+ {code} Instead of just the times for phases of execution, I suggest: * How much memory used on each host. * How many requests sent to namenode and/or Impala executors. * Overall network traffic for the whole operation. * Any other statistics about significant resources involved in doing REFRESH. * Number of new, changed, or removed files detected. For the whole table, and per partition. * Some obvious indication if the operation was a complete no-op, or if it was a no-op for N out of M partitions. I.e. to provide feedback that the operation wasn't needed at all, or could have been targeted better at only changed partitions. cc: [~alex.behm], [~dtsirogiannis] -- This message was sent by Atlassian JIRA (v7.6.3#76005)