Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15997 )
Change subject: IMPALA-2658: Extend the NDV function to accept a precision ...................................................................... Patch Set 27: (34 comments) http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@13 PS27, Line 13: 1. NDV([DISTINCT | ALL] expression) : 2. NDV([DISTINCT | ALL] expression, <abstract_value>) this doesn't look correct, take a look at the syntax for group_concat: https://impala.apache.org/docs/build/html/topics/impala_group_concat.html http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@14 PS27, Line 14: <abstract_value> what does "abstract value" mean? it should have a more descriptive name, like "precision_scale" http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@31 PS27, Line 31: , nit: delete http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@34 PS27, Line 34: Hll nit: should be "HLL". same applies to other references to "hll" in this commit message. http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@39 PS27, Line 39: The following two new error messages can be raised. : : 1. Second parameter of NDV() must be an integer literal. : 2. Second parameter of NDV() must be an integer literal in [1,10]. don't think this section is necessary http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@46 PS27, Line 46: tempalte typo http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@51 PS27, Line 51: HDV typo: NDV() http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@53 PS27, Line 53: Testing this section + the performance section should be written in the past tense because it refers to testing that you already did + performance runs that already complete http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@57 PS27, Line 57: select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 3) three, : ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 6) as six, : ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 9) as nine, : ndv(c_name, 10) as ten : from tpch.customer; : : select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) two, : ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as four, : ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as six, : ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as eight, : ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as ten : from tpcds.store_sales; I don't think u need to explicitly include these http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@71 PS27, Line 71: s nit: typo http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@72 PS27, Line 72: ; nit: delete http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@78 PS27, Line 78: - 5 sets with 10 million unique strings : - 5 sets with 10 million unique integers : - 5 sets with 100 million unique strings : - 5 sets with 97 million unique integers : - 1 set with 499 million unique strings : - 1 set with 450 million unique integers I don't think you really need to list these out, you don't list out the error for each experiment anyway, so I'm not sure what value listing out each experiment adds. http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@99 PS27, Line 99: pretty a no-op nit: revise this part of the sentence http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/common/logging.h File be/src/common/logging.h: http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/common/logging.h@48 PS27, Line 48: #define DCHECK_IN_RANGE(x, low, high) \ : { \ : DCHECK_GE(x, low); \ : DCHECK_LE(x, high); \ : } I don't think this need to be inside this 'IR_COMPILE' if statement, it can be moved to later in the class, perhaps before DCHECK_ENUM_EQ. this function should have some comments describing what it does. for example, mention that the "in-range" check is inclusive of the "low" and "high" value. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1468 PS13, Line 1468: IntVal* int_val = reinterpret_cast<IntVal*>(ctx->GetConstantArg(1)); > I improved it a little bit, from log2(x) to BitUtil::CountTrailingZeros((un I think this is possible using the 'GetFunctionState()' and 'SetFunctionState' methods in FunctionContext. I think any memory allocated that is shared needs to be allocated / freed using Allocate/Free methods in FunctionContext. http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1445 PS27, Line 1445: #ifndef NDEBUG u don't need to surround this in a NDEBUG block since this is only called inside another NDEBUG block. so in a release build, this will already be dead code. http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1446 PS27, Line 1446: AbstractValue I think "AbstractValue" is an unclear way of referring to the 2nd parameter of the NDV function. something like "PrecisionScale" would be much clearer. this would require re-factoring the name here and elsewhere. http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1576 PS27, Line 1576: hll_len should be 'hll_len^2'? http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@2487 PS27, Line 2487: // HLL version that accepts two arguments. I'm not really sure this is more descriptive than the previous version. the point I'm trying to make here is that the comment should describe what the two arguments represent. same comment applies for the description of the "single argument" version above. http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@196 PS27, Line 196: // default precision you can delete this now, since the name makes it clear it is the default precision http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@199 PS27, Line 199: <c> what does "<c>" mean? http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@199 PS27, Line 199: 2nd form I'm not sure referring to the new NDV function as a "2nd form" really makes sense. it would just be clearer to refer to the parameter name itself since logically NDV(expression) still has a default precision value. http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@214 PS27, Line 214: This version updates for the single input version of NDV() revise to something like: "Update function for single input version of NDV()" http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@219 PS27, Line 219: double input what do you mean by "double input"? http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@196 PS13, Line 196: DEFAULT_HLL_PRECISION = 10; // default p > Done. yeah generally fine to rename variables that span classes since it is just re-factoring. http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@21 PS27, Line 21: import java.math.BigDecimal; is this used anywhere? http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@479 PS27, Line 479: protected this can be a private function http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@59 PS27, Line 59: Hyperloglog nit: HyperLogLog http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@59 PS27, Line 59: ndv nit: let's stick to the upper-version of function methods, so this should be 'NDV()' http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@64 PS27, Line 64: hash nit: remove this. description of an instance variable should refer to the base type, not the concrete type. http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@894 PS27, Line 894: static private should be 'private static' http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1368 PS27, Line 1368: getArgs() is it always guaranteed that 'getArgs()' returns an array of at least length 1? It would be good to add Preconditions check that validates this. http://gerrit.cloudera.org:8080/#/c/15997/13/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: http://gerrit.cloudera.org:8080/#/c/15997/13/tests/query_test/test_aggregation.py@86 PS13, Line 86: ndv_results since this only used in the test_ndv function it can be scoped to the method itself rather than defined globally. http://gerrit.cloudera.org:8080/#/c/15997/13/tests/query_test/test_aggregation.py@293 PS13, Line 293: # Test two argument version of NDV() against different : # column data types. The 2nd argument is an integer in range [1, 10]. this should be converted to a Python method comment. e.g. under the 'def test_ndv(self):' there should a block like this: """Test two argument version of NDV() against different column data types. The 2nd argument is an integer in range [1, 10].""" -- To view, visit http://gerrit.cloudera.org:8080/15997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58 Gerrit-Change-Number: 15997 Gerrit-PatchSet: 27 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Fri, 12 Jun 2020 16:56:59 +0000 Gerrit-HasComments: Yes
