Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 4:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30
PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2%
            : range from the correct result but there are occasional spikes 
where
            : the difference is bigger but never goes out of the range of 5%.
> A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful
Sure, done.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593
PS3, Line 1593:     sketch_ptr->update((void*)(&src.val16), 
sizeof(impala::int128_t));
> We don't need to use the convenient typed update function, we can simply ha
Good point! I don't have to loose precision here for int128 as casting it to 
void* and giving the length to update() would also work.
I can also get rid of this warning/error functionality then.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595
PS3, Line 1595: }
> I think that this can be raised a few time in a query if more than one inst
Indeed, I just tried this locally on a bigger dataset and apparently there is 
one warning per instance. In my opinion we can leave it as it is as the purpose 
here was not to flood the screen with one error per row.
Modified the comment, though.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625
PS3, Line 1625:
              : StringVal AggregateFunctions::DsHllSerialize(FunctionContext* 
ctx,
              :     const StringVal& src) {
> Does Hive also convert to double (as seconds from Unix epoch)? Unless we ar
Leave this question open until we find out how Hive tackles the same if it does.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650
PS3, Line 1650:   union_sketch.update(*dst_sketch_ptr);
              :
> optional: I was looking for ways to avoid the 2 copies that occur here and
I found that there is another interface for deserialization:
deserialize(const void* bytes, size_t len)
I can use this to get rid of the 2 copies.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684
PS3, Line 1684:   static const uint32
> It would be nice to add a comment somewhere that describes the way hll_sket
Sure, added a comment about this to DsHllInit where I first allocate memory for 
hll_sketch.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18
PS3, Line 18:     ds_hll_estimate(ds_hll_sketch(timestamp_col))
> whitespace
Done


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21
PS3, Line 21: 100
> Are you sure that this won't be flaky? The value is the same as with count
I've run this a few times and appeared stable. In theory until a certain size 
of data HLL will pre-populate its cache and during this phase it will give 
correct results. It will start to approximate only after this certain size.

Apparently, alltypessmall is small enough to always give the correct result;


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90
PS3, Line 90: insert into sketch_store
> Do you know if this is also the case in Hive?
This error scenario could be dropped using the update(void* bytes, size_t 
length) interface.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115
PS3, Line 115: where date_string_col = "03/01/09";
> Can you mention that this will change in the future when BINARY will be imp
Done


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142:
> Do you know how Hive handles this? I would consider returning an error inst
We currently give an error for types not supported for ds_hll_functions. E.g. 
ds_hll_estimate() only accepts strings and for other types it gives an 
AnalysisException.
For strings we can decide if it's a serialized sketch only when we read that 
particular row so we can't return earlier. What I can do here is to write a 
warning or give an error depending on how ABORT_ON_ERROR is set.
What do you think?



--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Jun 2020 11:53:09 +0000
Gerrit-HasComments: Yes

Reply via email to