[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Sahil Takiar (Code Review) Fri, 12 Jun 2020 09:57:40 -0700

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )


Change subject: IMPALA-2658: Extend the NDV function to accept a precision
......................................................................


Patch Set 27:

(34 comments)

http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@13
PS27, Line 13: 1. NDV([DISTINCT | ALL] expression)
             : 2. NDV([DISTINCT | ALL] expression, <abstract_value>)
this doesn't look correct, take a look at the syntax for group_concat: 
https://impala.apache.org/docs/build/html/topics/impala_group_concat.html


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@14
PS27, Line 14: <abstract_value>
what does "abstract value" mean? it should have a more descriptive name, like 
"precision_scale"


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@31
PS27, Line 31: ,
nit: delete


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@34
PS27, Line 34: Hll
nit: should be "HLL". same applies to other references to "hll" in this commit 
message.


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@39
PS27, Line 39: The following two new error messages can be raised.
             :
             : 1. Second parameter of NDV() must be an integer literal.
             : 2. Second parameter of NDV() must be an integer literal in 
[1,10].
don't think this section is necessary


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@46
PS27, Line 46: tempalte
typo


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@51
PS27, Line 51: HDV
typo: NDV()


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@53
PS27, Line 53: Testing
this section + the performance section should be written in the past tense 
because it refers to testing that you already did + performance runs that 
already complete


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@57
PS27, Line 57:    select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 
3) three,
             :    ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 
6) as six,
             :    ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 
9) as nine,
             :    ndv(c_name, 10)  as ten
             :    from tpch.customer;
             :
             :    select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) 
two,
             :    ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as 
four,
             :    ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as 
six,
             :    ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as 
eight,
             :    ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as 
ten
             :    from tpcds.store_sales;
I don't think u need to explicitly include these


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@71
PS27, Line 71: s
nit: typo


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@72
PS27, Line 72: ;
nit: delete


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@78
PS27, Line 78:     - 5 sets with 10 million unique strings
             :     - 5 sets with 10 million unique integers
             :     - 5 sets with 100 million unique strings
             :     - 5 sets with 97 million unique integers
             :     - 1 set with 499 million unique strings
             :     - 1 set with 450 million unique integers
I don't think you really need to list these out, you don't list out the error 
for each experiment anyway, so I'm not sure what value listing out each 
experiment adds.


http://gerrit.cloudera.org:8080/#/c/15997/27//COMMIT_MSG@99
PS27, Line 99: pretty a no-op
nit: revise this part of the sentence


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/common/logging.h
File be/src/common/logging.h:

http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/common/logging.h@48
PS27, Line 48: #define DCHECK_IN_RANGE(x, low, high) \
             :   {                                   \
             :     DCHECK_GE(x, low);                \
             :     DCHECK_LE(x, high);               \
             :   }
I don't think this need to be inside this 'IR_COMPILE' if statement, it can be 
moved to later in the class, perhaps before DCHECK_ENUM_EQ.

this function should have some comments describing what it does. for example, 
mention that the "in-range" check is inclusive of the "low" and "high" value.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1468
PS13, Line 1468:     IntVal* int_val = 
reinterpret_cast<IntVal*>(ctx->GetConstantArg(1));
> I improved it a little bit, from log2(x) to BitUtil::CountTrailingZeros((un
I think this is possible using the 'GetFunctionState()' and 'SetFunctionState' 
methods in FunctionContext. I think any memory allocated that is shared needs 
to be allocated / freed using Allocate/Free methods in FunctionContext.


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1445
PS27, Line 1445: #ifndef NDEBUG
u don't need to surround this in a NDEBUG block since this is only called 
inside another NDEBUG block. so in a release build, this will already be dead 
code.


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1446
PS27, Line 1446: AbstractValue
I think "AbstractValue" is an unclear way of referring to the 2nd parameter of 
the NDV function. something like "PrecisionScale" would be much clearer. this 
would require re-factoring the name here and elsewhere.


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@1576
PS27, Line 1576: hll_len
should be 'hll_len^2'?


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions-ir.cc@2487
PS27, Line 2487: // HLL version that accepts two arguments.
I'm not really sure this is more descriptive than the previous version. the 
point I'm trying to make here is that the comment should describe what the two 
arguments represent.

same comment applies for the description of the "single argument" version above.


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@196
PS27, Line 196: // default precision
you can delete this now, since the name makes it clear it is the default 
precision


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@199
PS27, Line 199: <c>
what does "<c>" mean?


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@199
PS27, Line 199: 2nd form
I'm not sure referring to the new NDV function as a "2nd form" really makes 
sense. it would just be clearer to refer to the parameter name itself since 
logically NDV(expression) still has a default precision value.


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@214
PS27, Line 214: This version updates for the single input version of NDV()
revise to something like: "Update function for single input version of NDV()"


http://gerrit.cloudera.org:8080/#/c/15997/27/be/src/exprs/aggregate-functions.h@219
PS27, Line 219:  double input
what do you mean by "double input"?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@196
PS13, Line 196: DEFAULT_HLL_PRECISION = 10; // default p
> Done.
yeah generally fine to rename variables that span classes since it is just 
re-factoring.


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@21
PS27, Line 21: import java.math.BigDecimal;
is this used anywhere?


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@479
PS27, Line 479: protected
this can be a private function


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@59
PS27, Line 59: Hyperloglog
nit: HyperLogLog


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@59
PS27, Line 59: ndv
nit: let's stick to the upper-version of function methods, so this should be 
'NDV()'


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@64
PS27, Line 64: hash
nit: remove this. description of an instance variable should refer to the base 
type, not the concrete type.


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@894
PS27, Line 894: static private
should be 'private static'


http://gerrit.cloudera.org:8080/#/c/15997/27/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1368
PS27, Line 1368: getArgs()
is it always guaranteed that 'getArgs()' returns an array of at least length 1? 
It would be good to add Preconditions check that validates this.


http://gerrit.cloudera.org:8080/#/c/15997/13/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/15997/13/tests/query_test/test_aggregation.py@86
PS13, Line 86: ndv_results
since this only used in the test_ndv function it can be scoped to the method 
itself rather than defined globally.


http://gerrit.cloudera.org:8080/#/c/15997/13/tests/query_test/test_aggregation.py@293
PS13, Line 293:   # Test two argument version of NDV() against different
              :   # column data types. The 2nd argument is an integer in range 
[1, 10].
this should be converted to a Python method comment. e.g. under the 'def 
test_ndv(self):' there should a block like this:

 """Test two argument version of NDV() against different
 column data types. The 2nd argument is an integer in range [1, 10]."""



--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Fri, 12 Jun 2020 16:56:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2658: Extend the NDV function to accept a precision

Reply via email to