Hi Impala development community, I am a member of Apache DataSketches team. I looked at the DataSketches functions in Impala and I have a few questions and suggestions.
Regarding the dependency. I see that the current approach is to copy all files from Datasketches into a single pile. Is there a better way? I see that you are using version 3.0.0 currently. We released version 3.1.0 about a month ago. It has some important bug fixes, and a Theta sketch feature to avoid some cost of deserialization. This feature can speed up Theta union operations. Regarding your approach to serializing sketches. We have two ways of serializing and deserializing sketches: stream and bytes. You are using bytes for deserializing, but a temporary stringstream for serializing https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1624 just to copy the internals of this stream as bytes (using memcopy) in the StringStreamToStringVal function https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/datasketches-common.cc#L80 I would think it should be faster to use serialization to bytes, and the StringStreamToStringVal function can be eliminated. Regarding deserialization. I see in some cases that a sketch constructor is called just to replace this instance with a deserialized one. This extra construction seems unnecessary. https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1819 Looking at this DsHllMerge function https://github.com/apache/impala/blob/3d365276ea00f349df3629944b731eb4408d2c4f/be/src/exprs/aggregate-functions-ir.cc#L1759 it seems that the merge is done pairwise. Is it possible to arrange this process as init, multiple merges and finalize (serialize) at the end? It is quite costly to initialize a union, update it with two sketches and then call get_result(). If many such merges happen, the overhead of initializing a fresh union and finalizing it for each pair can be substantial. Regards, Alexander.