[email protected] has posted comments on this change. ( http://gerrit.cloudera.org:8080/14666 )
Change subject: IMPALA-8205: Support number of true and false statistics for boolean column ...................................................................... Patch Set 6: (11 comments) > (8 comments) > > does this include computing stats for boolean columns in Kudu as > well? (if not thats okay, just wondering) No, I didn't consider kudu in this patch currently; Do you think it is valuable or necessary if I make this feature also support kudu in another patch? http://gerrit.cloudera.org:8080/#/c/14666/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14666/1//COMMIT_MSG@9 PS1, Line 9: > Please wrap the commit message at 72 characters per line and replace ";" wi Done http://gerrit.cloudera.org:8080/#/c/14666/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14666/3//COMMIT_MSG@11 PS3, Line 11: . > nit: . Done http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/catalog-op-executor.cc File be/src/exec/catalog-op-executor.cc: http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/catalog-op-executor.cc@286 PS1, Line 286: , > nit: . Done http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/incr-stats-util.cc File be/src/exec/incr-stats-util.cc: http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/incr-stats-util.cc@164 PS1, Line 164: num_trues > num_new_trues? Done http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/incr-stats-util.cc@165 PS1, Line 165: num_falses > num_new_falses? Done http://gerrit.cloudera.org:8080/#/c/14666/1/common/thrift/CatalogObjects.thrift File common/thrift/CatalogObjects.thrift: http://gerrit.cloudera.org:8080/#/c/14666/1/common/thrift/CatalogObjects.thrift@172 PS1, Line 172: required > Shoule this be optional? It's just set for boolean column. Even for non-boolean columns, we will return -1 to indicate the statistics is missing; I will double check it; http://gerrit.cloudera.org:8080/#/c/14666/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/14666/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@293 PS6, Line 293: "COUNT(CASE WHEN " + colRefSql + " = TRUE THEN 1 ELSE NULL END)"); > why not just use "sum(" + colRefSql ")");? I'm not sure if one way is bette Hi Sahil, I do conduct a test to compare their performance; Conclusion: I didn't find obvious performance difference; Below is my test detail: I made test based on parquet and textfile tables; The queries are as below I added a boolean column for a tpch table which is previously used for tpch test which 60 billion lines; Query 1: select sum(case when bool_test = true then 1 else 0 end) as numTrue, sum(case when bool_test = false then 1 else 0 end) as numFalse from [TEST TABLE]; Query 2: select sum(bool_test) as numTrue, count(bool_test) - sum(bool_test) from lineitem_boolean_text; Query: select sum(bool_test) as numTrue, count(bool_test) - sum(bool_test) from [TEST TABLE] In my impala cluster, both above queries' runtime varies from 4s ~ 10s, no obvious difference occured; Somtimes the Query 1 is faster than Query 2, and somtimes the opposite. http://gerrit.cloudera.org:8080/#/c/14666/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@297 PS6, Line 297: columnStatsSelectList.add("NULL"); : columnStatsSelectList.add("NULL"); > what is this for? The null is used to mark that this column is not a boolean column so the numTrue and numFalse stats is meaningless; You know that numTrue and numFalse is very different from other columns because it definitely depends on the column type. For non-boolean column, we use NULL to mark it. http://gerrit.cloudera.org:8080/#/c/14666/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java: http://gerrit.cloudera.org:8080/#/c/14666/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@512 PS1, Line 512: ing(); > nit: numTrues_ Done http://gerrit.cloudera.org:8080/#/c/14666/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@513 PS1, Line 513: > nit: numFalses_ Done http://gerrit.cloudera.org:8080/#/c/14666/3/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java: http://gerrit.cloudera.org:8080/#/c/14666/3/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@39 PS3, Line 39: /** > Unused imports? Done -- To view, visit http://gerrit.cloudera.org:8080/14666 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I991bee8e7fdc644d908289f5fe2ee8032cc2c431 Gerrit-Change-Number: 14666 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Thu, 27 Feb 2020 04:01:15 +0000 Gerrit-HasComments: Yes
