Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
......................................................................


Patch Set 13:

(21 comments)

Done and please review one more time.

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1441
PS13, Line 1441: ComputeSizeOfIntermediateTypeForNDV
> this needs documentation.
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1452
PS13, Line 1452:   if (ctx->GetNumArgs() == 2) {
> it seems this whole code block is only used to validate the input parameter
Wrapped the entire block in #ifndef NDEBUG.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1457
PS13, Line 1457: if (int_val)
> is this ever possible? can int_val be null if GetNumArgs() == 2?
Yes. In a parallel plan, this method can be called for the merge() part of the 
expression evaluation, where the actual number of arguments is 1 (the 
intermediate result).


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1466
PS13, Line 1466:   int hll_len = dst->len;
> this copies the int value from dst->len to hll_len, is this necessary? if n
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1468
PS13, Line 1468:   int precision = log2(hll_len);
> is there a reason this needs to be re-computed during each update?
I improved it a little bit, from log2(x) to 
BitUtil::CountTrailingZeros((unsigned int)size, sizeof(size) * CHAR_BIT).

We could cache the precision value during HllInit() call to a replace and 
reuse. But I have not found the place to do so. If we use ctx, does it mean we 
need to add a new data member?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1481
PS13, Line 1481: // Two input argument version, which calls the single input 
argument version
> same comment as elsewhere
Same answer as above.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1510
PS13, Line 1510: // Two input argument version which calls the single input 
argument version.
> should be more descriptive that just "Two input argument version"
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1523
PS13, Line 1523:   DCHECK_EQ(dst->len, src.len);
> is there a reason the order of these got switched?
Made the src.len the first argument in the DCHECK_EQ() which reads more natural.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1557
PS13, Line 1557: actual_harmonic_mean
> it doesn't look like this variable exists anywhere, did you mean 'harmonic_
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1575
PS13, Line 1575:
> nit: extra new line
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@2467
PS13, Line 2467: // Two input argument version.
> this could be more descriptive. what does the two input argument version do
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@192
PS13, Line 192:   /// 2) HyperLogLog in Practice (paper from google with some 
improvements)
> probably worth mentioning either here or elsewhere that these functions tak
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@196
PS13, Line 196: HLL_PRECISION = 10; // default precision
> probably worth just renaming this to DEFAULT_HLL_PRECISION to keep it consi
Done.

I hesitated a little bit last on making the change, as this will affect several 
other modules (e.g., sampled_ndv, and test  modules).


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@203
PS13, Line 203: HLL_LEN
> same here, rename to DEFAULT_HLL_LEN
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65
PS13, Line 65: HashMap
> should be Map instead of HashMap
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65
PS13, Line 65: ArrayList
> should be List instead of ArrayList: https://stackoverflow.com/questions/22
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@999
PS13, Line 999: fakedHllIntermediateType
> it might be cleaner to just use the defaultHllIntermediateType here instead
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1009
PS13, Line 1009: ArrayList
> should be declared as a "List".
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1013
PS13, Line 1013:         ndvList.add(AggregateFunction.createBuiltin(db, "ndv",
               :             Lists.newArrayList(t, Type.INT), Type.BIGINT, 
hllIntermediateType,
               :             prefix + 
"7HllInitEPN10impala_udf15FunctionContextEPNS1_9StringValE",
               :             prefix + HLL_UPDATE_SYMBOL_TWO_ARGS.get(t),
               :             prefix + 
"8HllMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_",
               :             null,
               :             prefix + 
"11HllFinalizeEPN10impala_udf15FunctionContextERKNS1_9StringValE",
               :             true, false, true));
> it looks like this is pretty much a duplicate of the code on lines 998-1005
Done.

Good point.


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1022
PS13, Line 1022:       builtinNDVs_.put(t, ndvList);
> isn't of adding all of these into a HashMap, what about just creating the A
According to Aman, this init() method is called only once during the init of 
the FE. Once it is done, the builtInDB is immutable and used throughout the 
entire life span of the compiler.

I would think it probably is better to pre-create these small objects in 
advance and reuse them for any number of NDV() in any queries processed. There 
is also the benefit of saving memory in JVM.

Would it be OK to keep the current logic?


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1365
PS13, Line 1365: ArrayList
> nit: should be List
Done



--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Mon, 08 Jun 2020 22:41:04 +0000
Gerrit-HasComments: Yes

Reply via email to