[Impala-ASF-CR] IMPALA-5310: Add COMPUTE STATS TABLESAMPLE.

Dimitris Tsirogiannis (Code Review) Tue, 26 Sep 2017 16:49:12 -0700

Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8136 )


Change subject: IMPALA-5310: Add COMPUTE STATS TABLESAMPLE.
......................................................................


Patch Set 1:

(26 comments)

First pass. High level approach seems reasonable to me. Will take a look at the 
tests next.

http://gerrit.cloudera.org:8080/#/c/8136/1/be/src/exec/catalog-op-executor.cc
File be/src/exec/catalog-op-executor.cc:

http://gerrit.cloudera.org:8080/#/c/8136/1/be/src/exec/catalog-op-executor.cc@218
PS1, Line 218: // The first column is the COUNT(*) expr of the original query.
Can you put a similar comment above L206. I had a hard time figuring out what 
rows[0].colVals[0] represented in L208.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@58
PS1, Line 58: Existing partition-level row counts are not modified.
Mention what happens to the partitions that don't have the row count set.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@60
PS1, Line 60: unmodified
remove


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@61
PS1, Line 61: extrapolated
Maybe add some details about how extrapolation is computed. I haven't looked at 
the entire patch so you may mention it elsewhere. If so, you can just point to 
the file that provides that description.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@62
PS1, Line 62:  so as not to confuse other engines like Hive/SparkSQL which may 
rely on
            :  *   the shared HMS fields being accurate.
That part may be confusing to someone with no background knowledge. For 
instance, the fact that other engines require "accurate" statistics which we 
update using sampling and extrapolation :), kind of oxymoron. I think it is ok 
to just say that we store the extrapolated stats and remove that part. Thoughts?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@64
PS1, Line 64: Independently, row-count extrapolation is also done during 
planning based on the
            :  *   numRows / totalSize ratio because the table contents may 
have changed since the
            :  *   last time COMPUTE STATS was run.
Remove and put it near the relevant code (planning).


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@77
PS1, Line 77: Not supported for now to keep the logic/code simple.
Ha, this is the opposite in the COMPUTE STATS case.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@106
PS1, Line 106: totalFileBytes_
Can't you use the HdfsTable.totalFileBytes_ instead?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@108
PS1, Line 108: Only set when
             :   // TABLESAMPLE was specified. Set to -1 for non-HDFS tables
Maybe "Set to -1 for non-HDFS tables or when TABLESAMPLE is not specified"?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@305
PS1, Line 305: expectAllPartitions_ = !updateTableStatsOnly()
So if I do not update table stats only I should expect to compute stats on all 
partitions? Essentially, is isIncremental check blended in this function? I 
think it may make sense to extract that check.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@344
PS1, Line 344: expectAllPartitions_ = false;
Why is this needed? Isn't L305 sufficient?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@440
PS1, Line 440: // Only add group by clause for HdfsTables.
Comment doesn't seem relevant to the code that follows.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@442
PS1, Line 442: !updateTableStatsOnly()
expectAllPartitions_?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@452
PS1, Line 452: !updateTableStatsOnly()
expectAllPartitions_?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@506
PS1, Line 506: sets 'expectAllPartitions_' to false
Hm, I don't see that in the code.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@511
PS1, Line 511: base
remove?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@34
PS1, Line 34: import org.apache.impala.service.BackendConfig;
I don't think this is needed.


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1877
PS1, Line 1877:   public Map<Long, List<FileDescriptor>> 
getFilesSample(Collection<HdfsPartition> inputParts,
nit: long line


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@184
PS1, Line 184: partition
nit: partitions


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@185
PS1, Line 185: numPartitionsWithNumRows
nit: alternatively numPartitionsWithRowCount_


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@691
PS1, Line 691: trace
I wonder if trace is the right logging level. Maybe info? It seems kind of 
important to log this and don't think it will result in too many log messages. 
Thoughts?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@699
PS1, Line 699: try {
Do we need the nested try here?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@735
PS1, Line 735: numUpdatedColumns.setRef(Long.valueOf(0));
             :     if (colStats != null) {
             :       
numUpdatedColumns.setRef(Long.valueOf(colStats.getStatsObjSize()));
             :     }
Move after L705?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@857
PS1, Line 857: sampleFileBytes
Could that be 0?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@862
PS1, Line 862: Long.MAX_VALUE
Hm, that's a really big number :) Are you sure about that?


http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@866
PS1, Line 866:   private static ColumnStatisticsData 
createHiveColStatsData(TAlterTableUpdateStatsParams params,
nit: long line



--
To view, visit http://gerrit.cloudera.org:8080/8136
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7
Gerrit-Change-Number: 8136
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Sep 2017 23:48:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Add COMPUTE STATS TABLESAMPLE.

Reply via email to