Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7999 )

Change subject: [DOCS] Tighten up advice about first COMPUTE INCREMENTAL STATS
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7999/1/docs/shared/impala_common.xml
File docs/shared/impala_common.xml:

http://gerrit.cloudera.org:8080/#/c/7999/1/docs/shared/impala_common.xml@1226
PS1, Line 1226:         and the statistics are computed again from the 
beginning. Therefore, expect a one-time
from scratch


http://gerrit.cloudera.org:8080/#/c/7999/1/docs/shared/impala_common.xml@1241
PS1, Line 1241: -- by -1 under #Rows and false under Incremental stats.
I suggest you leave out the -1 under #Rows part since that may be confusing. 
The reason is that DROP INCREMENTAL STATS will *not* modify the #Rows.

Here's how you can think about incremental stats:
COMPUTE INCREMENTAL STATS populates the "regular" stats such as the #rows and 
column NDVs that COMPUTE STATS also does, but in addition it also stores 
"incremental stats" to speed up the next COMPUTE INCREMENTAL STATS. So the 
"incremental" part is really this extra information which you can drop 
separately from the "regular" stats.

One nice thing is that you can safely DROP INCREMENTAL STATS everywhere to 
reduce the size of table metadata without impacting query plans because the 
"regular" stats are preserved.


http://gerrit.cloudera.org:8080/#/c/7999/1/docs/topics/impala_partitioning.xml
File docs/topics/impala_partitioning.xml:

http://gerrit.cloudera.org:8080/#/c/7999/1/docs/topics/impala_partitioning.xml@611
PS1, Line 611:         Because the <codeph>COMPUTE STATS</codeph> statement can 
be resource-intensive to run frequently
This advice isn't prescriptive enough for my taste. We should state very 
clearly that you should use either COMPUTE STATS xor COMPUTE INCREMENTAL STATS 
but never both. Switching during the lifetime of a table is *not* recommended, 
but if you really must do so then we recommend you first drop all stats before 
the switch (using DROP STATS and DROP INCREMENTAL STATS).


http://gerrit.cloudera.org:8080/#/c/7999/1/docs/topics/impala_partitioning.xml@613
PS1, Line 613:         that is optimized for processing partitioned tables.
I wouldn't say that incremental stats is "optimized" for partitioned tables. 
Foremost, incremental stats allow you to compute stats in a 
partition-by-partition fashion which might be a better fit for a user's data 
ingestion pattern. However, we should be very clear about the cost of 
incremental stats. Incremental stats need ~400bytes per column per partition in 
the table metadata (which gets disseminated and cached everywhere), so 
incremental stats it not a good fit for tables with a huge number of columns 
and partitions. If you have a partitioned table and only a few of the 
partitions are "active" then you can compute incremental stats for new 
partitions coming in and drop incremental stats for those partitions "phased" 
out to limit your exposure to the metadata size problems.

You can even state that the huge table metadata can crash the catalog and/or 
impalads due to the Java 2GB array size limit. (We're working on fixing that)

Basically I want to be sure that users understand the cost of incremental stats 
and the impact (crash) of when they go overboard with incremental stats. There 
is no graceful degradation here.



--
To view, visit http://gerrit.cloudera.org:8080/7999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia53a6518ce5541e5c9a2cd896856ce042a599b03
Gerrit-Change-Number: 7999
Gerrit-PatchSet: 1
Gerrit-Owner: John Russell <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Greg Rahn <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Silvius Rus <[email protected]>
Gerrit-Comment-Date: Fri, 06 Oct 2017 04:20:20 +0000
Gerrit-HasComments: Yes

Reply via email to