Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15997 )
Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision ...................................................................... Patch Set 13: (16 comments) some more comments http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1452 PS13, Line 1452: if (ctx->GetNumArgs() == 2) { it seems this whole code block is only used to validate the input parameters via DCHECKS. so it will only be run in debug builds, in which case it should be surrounded by a #ifdef NDEBUG ... #endif block. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1457 PS13, Line 1457: if (int_val) is this ever possible? can int_val be null if GetNumArgs() == 2? http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1466 PS13, Line 1466: int hll_len = dst->len; this copies the int value from dst->len to hll_len, is this necessary? if not, then can the int just be a const reference? http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1481 PS13, Line 1481: // Two input argument version, which calls the single input argument version same comment as elsewhere http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1510 PS13, Line 1510: // Two input argument version which calls the single input argument version. should be more descriptive that just "Two input argument version" http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1523 PS13, Line 1523: DCHECK_EQ(dst->len, src.len); is there a reason the order of these got switched? http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1557 PS13, Line 1557: actual_harmonic_mean it doesn't look like this variable exists anywhere, did you mean 'harmonic_mean' http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1575 PS13, Line 1575: nit: extra new line http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@2467 PS13, Line 2467: // Two input argument version. this could be more descriptive. what does the two input argument version do? saying something like "HLL version that accepts a precision value" would be more informative. http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@192 PS13, Line 192: /// 2) HyperLogLog in Practice (paper from google with some improvements) probably worth mentioning either here or elsewhere that these functions take in an optional precision. there should be some description of what the precision means and the range of acceptable values. http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65 PS13, Line 65: HashMap should be Map instead of HashMap http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@999 PS13, Line 999: fakedHllIntermediateType it might be cleaner to just use the defaultHllIntermediateType here instead of creating a fake one http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1009 PS13, Line 1009: ArrayList should be declared as a "List". http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1013 PS13, Line 1013: ndvList.add(AggregateFunction.createBuiltin(db, "ndv", : Lists.newArrayList(t, Type.INT), Type.BIGINT, hllIntermediateType, : prefix + "7HllInitEPN10impala_udf15FunctionContextEPNS1_9StringValE", : prefix + HLL_UPDATE_SYMBOL_TWO_ARGS.get(t), : prefix + "8HllMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_", : null, : prefix + "11HllFinalizeEPN10impala_udf15FunctionContextERKNS1_9StringValE", : true, false, true)); it looks like this is pretty much a duplicate of the code on lines 998-1005, its probably work abstracting this into a private method that takes the intermediate type as a parameter. even better would be to replace the call on lines 983-990 with this common method, you would have to make the update symbol a parameter as well. http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1022 PS13, Line 1022: builtinNDVs_.put(t, ndvList); isn't of adding all of these into a HashMap, what about just creating the AggregateFunction at runtime? e.g. instead of pre-populating all the possible AggregateFunctions for all possible intermediate types and all possible data types, just add a method here that creates it on-demand? FunctionCallExpr can then just call that method? I think its simpler than using this Map<Type, List<AggregateFunction>> object; there shouldn't be much overhead in just re-creating the object time since its small. http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1365 PS13, Line 1365: ArrayList nit: should be List -- To view, visit http://gerrit.cloudera.org:8080/15997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58 Gerrit-Change-Number: 15997 Gerrit-PatchSet: 13 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Fri, 05 Jun 2020 21:41:30 +0000 Gerrit-HasComments: Yes
