Alex Behm has posted comments on this change.

Change subject: IMPALA-2373: Extrapolate row counts for HDFS tables.
......................................................................


Patch Set 1:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/6840/1/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

PS1, Line 137: tables
> What does this mean? I thought we only extrapolate row count of partitions.
Removed the "table" part. I had it there because of unpartitioned tables for 
which we do extrapolate.


http://gerrit.cloudera.org:8080/#/c/6840/1/be/src/util/backend-gflag-util.cc
File be/src/util/backend-gflag-util.cc:

PS1, Line 36: DECLARE_bool
> nit: they seem to be grouped by type
Done


http://gerrit.cloudera.org:8080/#/c/6840/1/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

PS1, Line 140: HDFS
> How about S3? Does it make sense to rename the total_hdfs_bytes into total_
1. We have this problem everywhere (e.g. HdfsScanNode, HdfsTable, 
hdfs-scanner.cc, etc.). These are really "FilesystemTables" or something like 
that.
2. It may be inaccurate, but consistent, i.e., I really mean this is only set 
for TTableType HDFS_TABLE.
3. Adding inconsistency seems worse to me. It may mislead readers to think it 
can be set for HBase, Kudu and whatnot.

Renamed to total_file_bytes and modified comment.


http://gerrit.cloudera.org:8080/#/c/6840/1/common/thrift/JniCatalog.thrift
File common/thrift/JniCatalog.thrift:

PS1, Line 493: !is_incremental
> nit: blending code in comments :) "is_incremental is false"
Done


Line 494:   9: optional i64 total_hdfs_bytes
> why is this a parameter/an input of compute stats?
We need to pass this value from the impalad FE to the impala BE and then to the 
catalogd. We currently set this value in ComputeStatsStmt.toThrift() which 
seems to me like the simplest solution.

How else would we pass this value?
- invent a special Expr which we can put in the SQL of the table-stats child 
query
- compute it based on the table-stats child query query profile after the query 
completed (profile counter currently does not exist, and is not easy to deduce, 
I checked)
- other ideas?


http://gerrit.cloudera.org:8080/#/c/6840/1/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

PS1, Line 587: hdfsTable
> nit: inline
Done


http://gerrit.cloudera.org:8080/#/c/6840/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS1, Line 185: private long totalHdfsBytes_;
> Do we still need this?
Yes. This is the actual total number of bytes based on the current set of file 
descriptors. We cannot replace it by the value stored in the stats just like we 
cannot replace a count(*) query with the row count from the stats because the 
set of files may have changed since the last stats computation.


PS1, Line 1328: tableStats_.num_rows
> getNumRows()?
Done


PS1, Line 1912: getExtrapolatedNumRows(p.getSize())
> Why do we need to compute this if we already have p.getNumRows() and is >= 
We follow the same policy here as with computing the scan cardinality. We 
always extrapolate because files could have been added/dropped from a partition 
since that partition's last stats computation. It's a case we explicitly wanted 
to handle.

Added comment.


PS1, Line 1958: tableStats_.num_rows
> getNumRows()?
Done


http://gerrit.cloudera.org:8080/#/c/6840/1/fe/src/main/java/org/apache/impala/catalog/Table.java
File fe/src/main/java/org/apache/impala/catalog/Table.java:

PS1, Line 76: // estimated number of rows in table; -1: unknown.
> update comment
Done


Line 492:     Preconditions.checkState(this instanceof HdfsTable);
> why have this function live here and not in hdfstable?
Good point. Done.


http://gerrit.cloudera.org:8080/#/c/6840/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

PS1, Line 616: computeStats(Analyzer analyzer) {
> The logic in this function is now very hard to follow. It's not clear when 
Added comments and spacing to separate logically different blocks. Hopefully, 
this helps clarify.


http://gerrit.cloudera.org:8080/#/c/6840/1/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test:

Line 114:    stats-rows=7300 extrapolated-rows=7300
> to reduce verbosity, print the extrapolated count only when it differs from
We're using up a line already, and I think it's easier for users to see that 
the two numbers are the same as opposed to knowing that the code omits one of 
the numbers if it is he same as the other one (explicit is easier to reason 
about vs. implied absence).


http://gerrit.cloudera.org:8080/#/c/6840/1/testdata/workloads/functional-query/queries/QueryTest/alter-table.test
File testdata/workloads/functional-query/queries/QueryTest/alter-table.test:

Line 641: YEAR, MONTH, #ROWS, EXTRAP #ROWS, #FILES, SIZE, BYTES CACHED, CACHE 
REPLICATION, FORMAT, INCREMENTAL STATS, LOCATION
> extrap is a bit weird, and we don't use abbreviations elsewhere here. spell
The reason why I did not spell it out is because it wastes a lot of space in 
the SHOW TABLE STATS result table, but either way is fine with me.

So "EXTRAPOLATED #ROWS"? Want be be sure because it's very time consuming to 
change all these tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/6840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-HasComments: Yes

Reply via email to