Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/7999 )
Change subject: [DOCS] Tighten up advice about first COMPUTE INCREMENTAL STATS ...................................................................... Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/7999/1/docs/shared/impala_common.xml File docs/shared/impala_common.xml: http://gerrit.cloudera.org:8080/#/c/7999/1/docs/shared/impala_common.xml@1226 PS1, Line 1226: and the statistics are computed again from the beginning. Therefore, expect a one-time from scratch http://gerrit.cloudera.org:8080/#/c/7999/1/docs/shared/impala_common.xml@1241 PS1, Line 1241: -- by -1 under #Rows and false under Incremental stats. I suggest you leave out the -1 under #Rows part since that may be confusing. The reason is that DROP INCREMENTAL STATS will *not* modify the #Rows. Here's how you can think about incremental stats: COMPUTE INCREMENTAL STATS populates the "regular" stats such as the #rows and column NDVs that COMPUTE STATS also does, but in addition it also stores "incremental stats" to speed up the next COMPUTE INCREMENTAL STATS. So the "incremental" part is really this extra information which you can drop separately from the "regular" stats. One nice thing is that you can safely DROP INCREMENTAL STATS everywhere to reduce the size of table metadata without impacting query plans because the "regular" stats are preserved. http://gerrit.cloudera.org:8080/#/c/7999/1/docs/topics/impala_partitioning.xml File docs/topics/impala_partitioning.xml: http://gerrit.cloudera.org:8080/#/c/7999/1/docs/topics/impala_partitioning.xml@611 PS1, Line 611: Because the <codeph>COMPUTE STATS</codeph> statement can be resource-intensive to run frequently This advice isn't prescriptive enough for my taste. We should state very clearly that you should use either COMPUTE STATS xor COMPUTE INCREMENTAL STATS but never both. Switching during the lifetime of a table is *not* recommended, but if you really must do so then we recommend you first drop all stats before the switch (using DROP STATS and DROP INCREMENTAL STATS). http://gerrit.cloudera.org:8080/#/c/7999/1/docs/topics/impala_partitioning.xml@613 PS1, Line 613: that is optimized for processing partitioned tables. I wouldn't say that incremental stats is "optimized" for partitioned tables. Foremost, incremental stats allow you to compute stats in a partition-by-partition fashion which might be a better fit for a user's data ingestion pattern. However, we should be very clear about the cost of incremental stats. Incremental stats need ~400bytes per column per partition in the table metadata (which gets disseminated and cached everywhere), so incremental stats it not a good fit for tables with a huge number of columns and partitions. If you have a partitioned table and only a few of the partitions are "active" then you can compute incremental stats for new partitions coming in and drop incremental stats for those partitions "phased" out to limit your exposure to the metadata size problems. You can even state that the huge table metadata can crash the catalog and/or impalads due to the Java 2GB array size limit. (We're working on fixing that) Basically I want to be sure that users understand the cost of incremental stats and the impact (crash) of when they go overboard with incremental stats. There is no graceful degradation here. -- To view, visit http://gerrit.cloudera.org:8080/7999 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia53a6518ce5541e5c9a2cd896856ce042a599b03 Gerrit-Change-Number: 7999 Gerrit-PatchSet: 1 Gerrit-Owner: John Russell <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Greg Rahn <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Silvius Rus <[email protected]> Gerrit-Comment-Date: Fri, 06 Oct 2017 04:20:20 +0000 Gerrit-HasComments: Yes
