Hi Impala development community,
I am a member of Apache DataSketches team. I looked at the DataSketches
functions in Impala and I have a few questions and suggestions.

Regarding the dependency. I see that the current approach is to copy all
files from Datasketches into a single pile. Is there a better way?

I see that you are using version 3.0.0 currently. We released version 3.1.0
about a month ago. It has some important bug fixes, and a Theta sketch
feature to avoid some cost of deserialization. This feature can speed up
Theta union operations.

Regarding your approach to serializing sketches. We have two ways of
serializing and deserializing sketches: stream and bytes. You are using
bytes for deserializing, but a temporary stringstream for serializing
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624
just to copy the internals of this stream as bytes (using memcopy) in the
StringStreamToStringVal function
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80
I would think it should be faster to use serialization to bytes, and the
StringStreamToStringVal function can be eliminated.

Regarding deserialization. I see in some cases that a sketch constructor is
called just to replace this instance with a deserialized one. This extra
construction seems unnecessary.
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819

Looking at this DsHllMerge function
https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759
it seems that the merge is done pairwise. Is it possible to arrange this
process as init, multiple merges and finalize (serialize) at the end? It is
quite costly to initialize a union, update it with two sketches and then
call get_result(). If many such merges happen, the overhead of initializing
a fresh union and finalizing it for each pair can be substantial.

Regards,
Alexander.

Reply via email to