Dimitris Tsirogiannis has posted comments on this change. ( http://gerrit.cloudera.org:8080/8136 )
Change subject: IMPALA-5310: Add COMPUTE STATS TABLESAMPLE. ...................................................................... Patch Set 1: (26 comments) First pass. High level approach seems reasonable to me. Will take a look at the tests next. http://gerrit.cloudera.org:8080/#/c/8136/1/be/src/exec/catalog-op-executor.cc File be/src/exec/catalog-op-executor.cc: http://gerrit.cloudera.org:8080/#/c/8136/1/be/src/exec/catalog-op-executor.cc@218 PS1, Line 218: // The first column is the COUNT(*) expr of the original query. Can you put a similar comment above L206. I had a hard time figuring out what rows[0].colVals[0] represented in L208. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@58 PS1, Line 58: Existing partition-level row counts are not modified. Mention what happens to the partitions that don't have the row count set. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@60 PS1, Line 60: unmodified remove http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@61 PS1, Line 61: extrapolated Maybe add some details about how extrapolation is computed. I haven't looked at the entire patch so you may mention it elsewhere. If so, you can just point to the file that provides that description. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@62 PS1, Line 62: so as not to confuse other engines like Hive/SparkSQL which may rely on : * the shared HMS fields being accurate. That part may be confusing to someone with no background knowledge. For instance, the fact that other engines require "accurate" statistics which we update using sampling and extrapolation :), kind of oxymoron. I think it is ok to just say that we store the extrapolated stats and remove that part. Thoughts? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@64 PS1, Line 64: Independently, row-count extrapolation is also done during planning based on the : * numRows / totalSize ratio because the table contents may have changed since the : * last time COMPUTE STATS was run. Remove and put it near the relevant code (planning). http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@77 PS1, Line 77: Not supported for now to keep the logic/code simple. Ha, this is the opposite in the COMPUTE STATS case. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@106 PS1, Line 106: totalFileBytes_ Can't you use the HdfsTable.totalFileBytes_ instead? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@108 PS1, Line 108: Only set when : // TABLESAMPLE was specified. Set to -1 for non-HDFS tables Maybe "Set to -1 for non-HDFS tables or when TABLESAMPLE is not specified"? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@305 PS1, Line 305: expectAllPartitions_ = !updateTableStatsOnly() So if I do not update table stats only I should expect to compute stats on all partitions? Essentially, is isIncremental check blended in this function? I think it may make sense to extract that check. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@344 PS1, Line 344: expectAllPartitions_ = false; Why is this needed? Isn't L305 sufficient? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@440 PS1, Line 440: // Only add group by clause for HdfsTables. Comment doesn't seem relevant to the code that follows. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@442 PS1, Line 442: !updateTableStatsOnly() expectAllPartitions_? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@452 PS1, Line 452: !updateTableStatsOnly() expectAllPartitions_? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@506 PS1, Line 506: sets 'expectAllPartitions_' to false Hm, I don't see that in the code. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@511 PS1, Line 511: base remove? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java: http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@34 PS1, Line 34: import org.apache.impala.service.BackendConfig; I don't think this is needed. http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1877 PS1, Line 1877: public Map<Long, List<FileDescriptor>> getFilesSample(Collection<HdfsPartition> inputParts, nit: long line http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@184 PS1, Line 184: partition nit: partitions http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@185 PS1, Line 185: numPartitionsWithNumRows nit: alternatively numPartitionsWithRowCount_ http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@691 PS1, Line 691: trace I wonder if trace is the right logging level. Maybe info? It seems kind of important to log this and don't think it will result in too many log messages. Thoughts? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@699 PS1, Line 699: try { Do we need the nested try here? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@735 PS1, Line 735: numUpdatedColumns.setRef(Long.valueOf(0)); : if (colStats != null) { : numUpdatedColumns.setRef(Long.valueOf(colStats.getStatsObjSize())); : } Move after L705? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@857 PS1, Line 857: sampleFileBytes Could that be 0? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@862 PS1, Line 862: Long.MAX_VALUE Hm, that's a really big number :) Are you sure about that? http://gerrit.cloudera.org:8080/#/c/8136/1/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@866 PS1, Line 866: private static ColumnStatisticsData createHiveColStatsData(TAlterTableUpdateStatsParams params, nit: long line -- To view, visit http://gerrit.cloudera.org:8080/8136 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7 Gerrit-Change-Number: 8136 Gerrit-PatchSet: 1 Gerrit-Owner: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Comment-Date: Tue, 26 Sep 2017 23:48:52 +0000 Gerrit-HasComments: Yes