Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30
PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2%
            : range from the correct result but there are occasional spikes 
where
            : the difference is bigger but never goes out of the range of 5%.
A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful if 
someone is curious about the details.

I could be also mentioned that while the HLL implementation can be 
parametrized, we are using it in a fixed way (HLL_4 with k=11).


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593
PS3, Line 1593:     // DataSketches doesn't accept 128bit integers so here we 
can sacrifice precision for
We don't need to use the convenient typed update function, we can simply hash 
this as a 16 byte string.
The underlying Murmur hash implementation treats all types as byte array + 
length as far as I know.

Do you know how Hive handles this? Even if it doesn't support it we should 
still mention in a comment that we are emulating Hive here.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595
PS3, Line 1595:     // Raise this warning only once per query.
I think that this can be raised a few time in a query if more than one 
instances are working on the table.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625
PS3, Line 1625:   double val;
              :   const Timezone* tz = 
ctx->impl()->state()->time_zone_for_unix_time_conversions();
              :   if (tm_src.ToSubsecondUnixTime(tz, &val)) {
Does Hive also convert to double (as seconds from Unix epoch)? Unless we are 
emulating Hive, I think that this is a not the best way to do it, as this 
conversion loses precision for timestamps far from 1970-01-01 (the same issue 
exists for AVG too).


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650
PS3, Line 1650:   std::string 
serialized_src_sketch_str(reinterpret_cast<char*>(src.ptr), src.len);
              :   std::stringstream 
serialized_src_sketch(serialized_src_sketch_str);
optional: I was looking for ways to avoid the 2 copies that occur here and at 
similar places. There is no istream implementation that uses a simple byte 
array + length, but it is actually very simple to create one: 
https://tuttlem.github.io/2014/08/18/getting-istream-to-work-off-a-byte-array.html
 Something similar should be possible for serialization too.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684
PS3, Line 1684:   ctx->Free(src.ptr);
It would be nice to add a comment somewhere that describes the way hll_sketch 
works - e.g. unlike NDV, the StringVal does not contain the HLL array directly, 
but a wrapper classes that uses heap for the objects it uses, including the HLL 
array itself.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18
PS3, Line 18:     ds_hll_estimate(ds_hll_sketch(timestamp_col))
whitespace


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21
PS3, Line 21: 100
Are you sure that this won't be flaky? The value is the same as with count 
distinct, but NDV is not accurate here (101). Adding a filter that limits the 
query to a single partition (and thus single file) could ensure that the value 
is the same regardless of the order of merges.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90
PS3, Line 90: UDF ERROR: 128bit Decimal precision not supported.
Do you know if this is also the case in Hive?


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115
PS3, Line 115: # Write sketches to a table as string and get an estimate from 
the written sketch.
Can you mention that this will change in the future when BINARY will be 
implemented (IMPALA-9482), as Hive writes BINARY instead of STRING?


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142: # Check that ds_hll_estimate returns null for strings that are 
not serialized sketches.
Do you know how Hive handles this? I would consider returning an error instead 
if don't support ds_hll for a type.



--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Jun 2020 13:16:23 +0000
Gerrit-HasComments: Yes

Reply via email to