Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/15997 )
Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision ...................................................................... Patch Set 13: (21 comments) Done and please review one more time. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1441 PS13, Line 1441: ComputeSizeOfIntermediateTypeForNDV > this needs documentation. Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1452 PS13, Line 1452: if (ctx->GetNumArgs() == 2) { > it seems this whole code block is only used to validate the input parameter Wrapped the entire block in #ifndef NDEBUG. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1457 PS13, Line 1457: if (int_val) > is this ever possible? can int_val be null if GetNumArgs() == 2? Yes. In a parallel plan, this method can be called for the merge() part of the expression evaluation, where the actual number of arguments is 1 (the intermediate result). http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1466 PS13, Line 1466: int hll_len = dst->len; > this copies the int value from dst->len to hll_len, is this necessary? if n Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1468 PS13, Line 1468: int precision = log2(hll_len); > is there a reason this needs to be re-computed during each update? I improved it a little bit, from log2(x) to BitUtil::CountTrailingZeros((unsigned int)size, sizeof(size) * CHAR_BIT). We could cache the precision value during HllInit() call to a replace and reuse. But I have not found the place to do so. If we use ctx, does it mean we need to add a new data member? http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1481 PS13, Line 1481: // Two input argument version, which calls the single input argument version > same comment as elsewhere Same answer as above. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1510 PS13, Line 1510: // Two input argument version which calls the single input argument version. > should be more descriptive that just "Two input argument version" Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1523 PS13, Line 1523: DCHECK_EQ(dst->len, src.len); > is there a reason the order of these got switched? Made the src.len the first argument in the DCHECK_EQ() which reads more natural. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1557 PS13, Line 1557: actual_harmonic_mean > it doesn't look like this variable exists anywhere, did you mean 'harmonic_ Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1575 PS13, Line 1575: > nit: extra new line Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@2467 PS13, Line 2467: // Two input argument version. > this could be more descriptive. what does the two input argument version do Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@192 PS13, Line 192: /// 2) HyperLogLog in Practice (paper from google with some improvements) > probably worth mentioning either here or elsewhere that these functions tak Done http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@196 PS13, Line 196: HLL_PRECISION = 10; // default precision > probably worth just renaming this to DEFAULT_HLL_PRECISION to keep it consi Done. I hesitated a little bit last on making the change, as this will affect several other modules (e.g., sampled_ndv, and test modules). http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@203 PS13, Line 203: HLL_LEN > same here, rename to DEFAULT_HLL_LEN Done http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65 PS13, Line 65: HashMap > should be Map instead of HashMap Done http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65 PS13, Line 65: ArrayList > should be List instead of ArrayList: https://stackoverflow.com/questions/22 Done http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@999 PS13, Line 999: fakedHllIntermediateType > it might be cleaner to just use the defaultHllIntermediateType here instead Done http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1009 PS13, Line 1009: ArrayList > should be declared as a "List". Done http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1013 PS13, Line 1013: ndvList.add(AggregateFunction.createBuiltin(db, "ndv", : Lists.newArrayList(t, Type.INT), Type.BIGINT, hllIntermediateType, : prefix + "7HllInitEPN10impala_udf15FunctionContextEPNS1_9StringValE", : prefix + HLL_UPDATE_SYMBOL_TWO_ARGS.get(t), : prefix + "8HllMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_", : null, : prefix + "11HllFinalizeEPN10impala_udf15FunctionContextERKNS1_9StringValE", : true, false, true)); > it looks like this is pretty much a duplicate of the code on lines 998-1005 Done. Good point. http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1022 PS13, Line 1022: builtinNDVs_.put(t, ndvList); > isn't of adding all of these into a HashMap, what about just creating the A According to Aman, this init() method is called only once during the init of the FE. Once it is done, the builtInDB is immutable and used throughout the entire life span of the compiler. I would think it probably is better to pre-create these small objects in advance and reuse them for any number of NDV() in any queries processed. There is also the benefit of saving memory in JVM. Would it be OK to keep the current logic? http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1365 PS13, Line 1365: ArrayList > nit: should be List Done -- To view, visit http://gerrit.cloudera.org:8080/15997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58 Gerrit-Change-Number: 15997 Gerrit-PatchSet: 13 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Mon, 08 Jun 2020 22:41:04 +0000 Gerrit-HasComments: Yes
