Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )
Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() ...................................................................... Patch Set 3: (11 comments) http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30 PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2% : range from the correct result but there are occasional spikes where : the difference is bigger but never goes out of the range of 5%. A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful if someone is curious about the details. I could be also mentioned that while the HLL implementation can be parametrized, we are using it in a fixed way (HLL_4 with k=11). http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593 PS3, Line 1593: // DataSketches doesn't accept 128bit integers so here we can sacrifice precision for We don't need to use the convenient typed update function, we can simply hash this as a 16 byte string. The underlying Murmur hash implementation treats all types as byte array + length as far as I know. Do you know how Hive handles this? Even if it doesn't support it we should still mention in a comment that we are emulating Hive here. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595 PS3, Line 1595: // Raise this warning only once per query. I think that this can be raised a few time in a query if more than one instances are working on the table. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625 PS3, Line 1625: double val; : const Timezone* tz = ctx->impl()->state()->time_zone_for_unix_time_conversions(); : if (tm_src.ToSubsecondUnixTime(tz, &val)) { Does Hive also convert to double (as seconds from Unix epoch)? Unless we are emulating Hive, I think that this is a not the best way to do it, as this conversion loses precision for timestamps far from 1970-01-01 (the same issue exists for AVG too). http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650 PS3, Line 1650: std::string serialized_src_sketch_str(reinterpret_cast<char*>(src.ptr), src.len); : std::stringstream serialized_src_sketch(serialized_src_sketch_str); optional: I was looking for ways to avoid the 2 copies that occur here and at similar places. There is no istream implementation that uses a simple byte array + length, but it is actually very simple to create one: https://tuttlem.github.io/2014/08/18/getting-istream-to-work-off-a-byte-array.html Something similar should be possible for serialization too. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684 PS3, Line 1684: ctx->Free(src.ptr); It would be nice to add a comment somewhere that describes the way hll_sketch works - e.g. unlike NDV, the StringVal does not contain the HLL array directly, but a wrapper classes that uses heap for the objects it uses, including the HLL array itself. http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18 PS3, Line 18: ds_hll_estimate(ds_hll_sketch(timestamp_col)) whitespace http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21 PS3, Line 21: 100 Are you sure that this won't be flaky? The value is the same as with count distinct, but NDV is not accurate here (101). Adding a filter that limits the query to a single partition (and thus single file) could ensure that the value is the same regardless of the order of merges. http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90 PS3, Line 90: UDF ERROR: 128bit Decimal precision not supported. Do you know if this is also the case in Hive? http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115 PS3, Line 115: # Write sketches to a table as string and get an estimate from the written sketch. Can you mention that this will change in the future when BINARY will be implemented (IMPALA-9482), as Hive writes BINARY instead of STRING? http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142 PS3, Line 142: # Check that ds_hll_estimate returns null for strings that are not serialized sketches. Do you know how Hive handles this? I would consider returning an error instead if don't support ds_hll for a type. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Wed, 03 Jun 2020 13:16:23 +0000 Gerrit-HasComments: Yes